Consistency of the Group Lasso 
and Multiple Kernel Learning 
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Abstract 

We consider the least-square regression problem with regularization by a block £i-norm, i.e., a 
sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as 
the group Lasso, extends the usual regularization by the ^i-norm where all spaces have dimension 
one, where it is commonly referred to as the Lasso. In this paper, we study the asymptotic model 
consistency of the group Lasso. We derive necessary and sufficient conditions for the consistency 
of group Lasso under practical assumptions, such as model misspecification. When the linear 
predictors and Euclidean norms are replaced by functions and reproducing kernel Hilbert norms, 
the problem is usually referred to as multiple kernel learning and is commonly used for learning 
from heterogeneous data sources and for non linear variable selection. Using tools from functional 
analysis, and in particular covariance operators, we extend the consistency results to this infinite 
dimensional case and also propose an adaptive scheme to obtain a consistent model estimate, even 
when the necessary condition required for the non adaptive scheme is not satisfied. 
Keywords: Sparsity, regularization, consistency, convex optimization, covariance operators 

1. Introduction 

Regularization has emerged as a dominant theme in machine learning and statistics. It provides an 
intuitive and principled tool for learning from high-dimensional data. Regularization by squared 
Euclidean norms or squared Hilbertian norms has been thoroughly studied in various settings, from 
approximation theory to statistics, leading to efficier it practical algorithms based on linea r alge- 
bra and very general theoret i cal consistency results (Tikhonov and Arsenin . 1997 , Wahba , 1990l. 



Hastie et all. 120011 . ISteinwartl . 120011 . ICucker and Smald. 120021) . 



In recent years, regularization by non Hilbertian norms has generated considerable interest in 
linear supervised learning, where the goal is to predict a response as a linear function of covaiiates; 
in particular, regularization by the £-\ -norm (eqiial to the sum of abso lute values), a method com- 
monly referred to as the Lasso ( Tibshirani . 1994 . Osborne et al. . 2000h . allows to perform variable 
selection. However, regularization by non Hilbertian norms cannot be solved empirically by simple 
linear algebra and instead leads to general convex optimization problems and much of the early 
effort has been dedic ated to algor i thms to solve the optimization problem efficiently. In particular, 
the Lars algorithm of lEfron et all (120041) allows to find the entire regularization path (i.e., the set of 
solutions for all values of the regularization parameters) at the cost of a single matrix inversion. 

As the consequence of the optimality conditions, regulariza tion by the £i - norm leads to sparse 
solutions, i e., loading vectors with m any zeros. Recent works ( Zhao and Yu , 200d . Yuan and Lin , 
20071 . Izoul liooel . IWainwrigii^ . bood) have looked precisely at the model consistency of the Lasso, 
i.e., if we know that the data were generated from a sparse loading vector, does the Lasso actually 
recover it when the number of observed data points grows? In the case of a fixed number of co- 
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variates, the Lasso does recover the sparsit y pattern if and only i f a certain simple condition on the 
generating covariance matrices is verified (lYuan and Linl. 120071) . In particular, in low correlation 
settings, the Lasso is indeed consistent. However, in presence of strong correlations, the Lasso can- 
not be consistent, shedding light on potential problems of such procedures for variable selection. 
Adaptive versions where da ta-dependen t weights are added to the ^i-norm then allow to keep the 
consistency in all situations (IZoul.l2006h . 

A related Lasso-type procedure is the group Lasso, where the covariates are assumed to be 
clustered in groups, and instead of summing the absolute values of each individual loading, the 
sum of Euclidean norms of the loadings in each group is used. Intuitivel y, this should drive al l the 
weights in one group to zero together, and thus lead to group selection (lYuan and Linl l2006h . In 
Section [21 we extend the consistency results of the Lasso to the group Lasso, showing that similar 
coiTclation conditions are necessary and sufficient conditions for consistency. The passage from 
groups of size one to groups of larger sizes leads however to a slightly weaker result as we can 
not get a single necessary and sufficient condition (in Section [Z41 we show that the stronger result 
similar to the Lasso is not true as soon as one group has dimension larger than one). Also, in our 
proofs, we relax the assumptions usually made for such consistency results, i.e., that the model is 
completely well-specified (conditional expectation of the response which is linear in the covariates 
and constant conditional variance). In the context of misspecification, which is a common situation 
when applying methods such as the ones presented in this paper, we simply prove convergence 
to the best linear predictor (which is assumed to be sparse), both in terms of loading vectors and 
sparsity patterns. 

The group Lasso essentially replaces groups of size one by groups of size larger than one. It 
is natural in this context to allow the size of each group to grow unbounded, i.e., to replace the 
sum of Euclidean norms by a sum of appropriate Hilbertian norms. When the Hilbert spaces are 
reproducing kernel Hilbert spaces (RKHS), this procedure turns out to be equivalent to leai^n the 
best convex combination of a set o f basis kernels, wh ere each kernel corresponds to one Hilber- 
tian norm used for regularization (Bach et all l2004al ). This framework, referred to as multiple 
kernel learning (IBach et all l2004ar) . has applications in k ernel selection, data fu sion from het- 
erogeneous data sources and non linear variable selection ((Lanckriet et al.l |2004aj). In this latter 
case, r nultiple kernel learning can e xactly be seen as variable selection in a generalized additive 
model ( Hastie and Tibshirani . 1990l) . We extend the consistency results of the group Lasso to this 
non parametric case, by using covariance operators and appropriate notions of functional analysis. 
These notions allow to carry out the analysis entirely in "primal/input" space, while the algorithm 
has to work in "dual/feature" space to avoid infinite dimensional optimization. Throughout the 
paper, we will always go back and forth between primal and dual formulations, primal formulation 
for analysis and dual formulation for algorithms. 

The paper is organized as follows: in Section|2l we present the consistency results for the group 
Lasso, while in Section [3l we extend these to Hilbert spaces. Finally, we present the adaptive 
schemes in Section 0] and illustrate our set of results with simulations on synthetic examples in 
Section |5] 



2. Consistency of the Group Lasso 



We consider the problem of predicting a response y € M from covariates X G W, where X has 
a block structure with m blocks, i.e., X = {Xj , . . . , X^)^ with each Xj S M*'^, j = 1, . . . , Im, 
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andEr=iP. 



p. Unless otherwise specified, will denote the Euclidean norm of a vector X. 



The only assumptions that we make on the joint distribution Pxy of {X, Y) are the following: 
(Al) X and Y have finite fourth order moments: EllXH"^ < oo and E||y ||^ < oo. 
(A2) The joint covariance matrix Sxx = EXX^ - {'EX){EX)^ € is invertible. 



(A3) We let (w, b) e 



denote any minimizer of E(y — X 



w 



b)^. We assume that 



E((y — w~^X — h)'^\X) is almost surely greater than fx^j^j > 0. We let denote J = {j, wj / 
0} the sparsity pattern of wQ 

The assumption (A|3ll does not state that E(y|X) is an affine function of X and that the conditional 
variance is constant, as it is commonly done in most works dealing with consistency for linear 
supervised learning. We simply assume that given the best affine predictor of Y given X (defined 
by w G and b € M), there is still a strictly positive amount of variance in Y. If (A|2l) is 
satisfied, then the full loading vector w is uniquely defined and is equal to w = {T,~^^)~^T,xy, 
where T,xy = E{XY) - (EX)(Ey) € M^. Note that throughout this paper, we do include a non 
regularized constant term b but since we use a square loss it will optimized out in closed form by 
centering the data. Thus all our consistency statements will be stated only for the loading vector w; 
corresponding results for b then immediately follow. 

We often use the notation e = Y — w^X — b. In terms of covariance matrices, our assumption 
(AH leads to: E^^i^ = E{ee\X) ^ a^;,^ and J^^x = (but e might not in general be independent 



from X). 



Applications of grouped variables In this paper, we assume that the groupings of the univariate 
variables is known and fixed, i.e., the group structure is given and we wish to achieve sparsity at the 
level of groups. This has numerous applic ations, e.g., in speech and signal processi ng, where groups 



may re present different fregu encv bands (IMcAulev et alll2005h. or bioinformati cs (ILanckriet et al 



2004al) and computer vision dVarma and Rayl. 120071 . iHarchaoui and BachL 120071) where each group 
may con^espond to different data source s or data types. Note tha t those different data sources are 
sometimes refeiTcd to as views (see, e.g.. lZhou and Burgesl . 120071) . 

Moreover, we always assume that the number m of groups is fixed and finite. Considering cases 
where m is allowed to grow with the number of observed data points, in the line of lMeinshausen and Yu 
(120061), is outside the scope of this paper. 



Notations Throughout this paper, we consider the block covariance matrix Sxx with blocks 
^XiXj, i,j = 1, . . . , m. We refer to the submatrix composed of all blocks indexed by sets /, J as 
^XjXj- Similarly, our loadings are vectors defined following block structure, w = {w^ 
and we denote wj the elements indexed by /. Moreover we denote Ig the vector in ] 
components equal to one, and Iq the identity matrix of size q. 



T 



with constant 



2.1 Group Lasso 

We consider independent and identically distributed (i.i.d.) data {xi,yi) £ MP x M, i = 1, . . . ,n, 
sampled from Pxy and the data are given in the form of matrices y € M" and X S M"^^ and 
we write X = {Xi, . . . , Xm) where each Xj € M"^^^ represents the data associated with group j. 

1. Note that throughout this paper, we use boldface fonts for population quantities. 
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Throughout this paper, we make the same i.i.d. assumption; dealing with non identically distributed 
or dependent data and extending our results in those situations are left for future research. 
We consider the following optimization problem: 



min — \\Y 



XW - blr 



where d € is a vector of strictly positive fixed weights. Note that considering weights in 
the block £i-norm is important in practice as those have an influence regarding the consistency of 
the estimator (see Section |4] for further details). Since b is not regularized, we can minimize in 
closed form with respect to b, by setting b = ^1^{Y — Xw). This leads to the following reduced 
optimization problem in w: 



mm 



1 . . 1 . 

-Syy - S^ylt; + -W~^txxW + A„, ^ dj 



\Wi 



(1) 



where Syy = -Y~^IlnY, f^xY = -X^^n^ and T,xx = -X~^^nX are empirical covariance 
matrices (with the centering matrix n„ defined as n„ = /„ — We denote w any minimizer 

of Eq. ([T]). We refer to w as the group Lasso estimatqj. Note that with probability tending to one, if 
(A|2l) is satisfied (i.e., if T^xx is invertible), there is a unique minimum. 

Problem ([T]) is a n on-differentiable convex optimiz ation problem, for which classical tools from 



convex optir nization (IBovd and Vandenberghd . 



(see proof by I Yuan and LinI (|2006r) and in Appendix lA.il ): 



2003h lead to the following optimality conditions 



Proposition 1 A vector w € 
problem ([7]) if and only if 



with sparsity pattern J = J{w) = {j, 



^XiXW — T,x,Y 



^ ^ndj, 



Vj G J, ^XjXW - ^XjY 



Xndj 
\\Wi\\ 



Wj / 0} is optimal for 



(2) 
(3) 



2.2 Algorithms 

Efficient exact algorithms exist for the regular Lasso, i.e., for the case where all group dimensions 
Pj are equal to one. They are based on the piecewise line arity of the set of solutions as a func- 



tion of the regularization parameter A„ (lEfron et al.l 120041) . For the group Lasso, however, the 



path is only piecewise differentiable, and following such a path is not as efficient as for the Lasso. 
Other algorithms ha ve been designed to s olve problem ([T]) for a single value of An, in the origin al 
group Lasso settin g ('Yuan and Linl^ 2006) and i n the multiple kernel setting ( Bach et al. , 2004allbl 
Sonnenburg et al. . 2006 . Rakotomanionjy et al.Ll2007b . In this paper, we study path consistency of 
the group Lasso and of m ultiple kern e l learni ng, and in simulations we use the publicly available 
code for the algorithm of iBach et al.l (|2004b|), that computes an approximate but entire path, by 
following the piecewise smooth path with predictor-corrector methods. 



2. We use the convention that all "hat" notations correspond to data-dependent and thus n-dependent quantities, so we 
do not need the explicit dependence on n. 
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2.3 Consistency Results 

We consider the following two conditions 

1 



max — 

ieJ" di 

1 

max — 



^XiXj^XjXj Diag(dj/||wj||)wj 



< 1, 



< 1, 



(4) 



(5) 



where Diag(dj/||wj ||) denotes the block-diagonal matrix (with block sizes pj) in which each di- 
agonal block is equal to n^^-^pj (with Ip^ the identity matrix of size pj), and wj denotes the 
concatenation of the loading vectors indexed by J. Note that the conditions involve the covariance 
between all active groups Xj, j € J and all non active groups Xi,i^ J*^. 

These are conditions on both the input (through the joint covariance matrix Sxx) and on the 
weight vector w. Note that, when all blocks have size 1 , this corresp onds to the conditions derived 
for the Lasso ( Zhao and Yu . 2006 , Yuan and Lin , 2007 , Zoul 20061) . Note also the difference be- 
tween the strong con dition Q and the weak condition ©. For the Lasso, with our assumptions. 
Yuan and LinI (120071) has shown that the strong condition ^ is necessary and sufficient for path 
consistency of the Lasso; i.e., the path of solutions consistently contains an estimate which is both 
consistent for the 2-norm (regular consistency) and the ^o-norm (consistency of patterns), if and 
only if condition (HJl is satisfied. 

In the case of the group Lasso, even with a finite fixed number of groups, our results are not as 
strong, as we can only get the strict condition as sufficient and the weak condition as necessary. In 
Section l2!4l we show that this cannot be improved in general. More precisely the following theorem, 
proved in Appendix IB.ll shows that if the condition dD is satisfied, any regularization parameter 
that satisfies a certain decay conditions will lead to a consistent estimator; thus the strong condition 
dU is sufficient for path consistency: 

Theorem 2 Assume fAlHIS]). If condition (0) is satisfied, then for any sequence A„ such that A„ ^ 
and Xn'n}^'^ +oo, then the group Lasso estimate w defined in Eq. (|7]) converges in probability 
to w and the group sparsity pattern J{w) = {j,Wj ^ 0} converges in probability to J (i.e., 
F{J{w) = J) ^ 1). 



The following theorem, proved in Appendix IB.2I states that if there is a consistent solution on 
the path, then the weak condition ^ must be satisfied. 

Theorem 3 Assume fAlll|3l). If there exists a (possibly data-dependent) sequence \n such that w 
converges to w and J{w) converges to J in probability, then condition (O is satisfied. 

On the one hand. Theorem |2] states that under the "low con^elation between variables in J and 
variables in J*^" condition the group Lasso is indeed consistent. On the other hand, the re- 
sult (and the similar one for the Lasso) is rather disappointing regarding the applicability of the 
group Lasso as a practical group selection method, as Theorem [3] states that if the weak correlation 
condition ^ is not satisfied, we cannot have consistency. 

Moreover, this is to be contrasted with a thresholding procedure of the joint least-square esti- 
mator, which is also consistent with no conditions (but the invertibility of T^xx), if the threshold is 
properly chosen (smaller than the smallest norm ||wj|| for j G J or with appropriate decay condi- 
tions). However, the Lasso and group Lasso do not have to set such a threshold; moreover, further 
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analysis show that the Lasso has ad ditional advantages over regular regularized least-square pro- 



they do perform better (iTibshiranil . 



cedure (IMeinshausen and Yul 12006 ), and empirical evidence shows that in the finite sample case, 



1994j) . in particular in the case where the number m of groups 



is allowed to grow. In this paper we focus on the extension from uni-dimensional groups to multi- 
dimensional groups for finite number of groups m and leave the possibility of letting m grow with 
n for future research. 

Finally, by looking carefully at condition (HJl and we can see that if we were to increase 
the weight dj for j G 3'^ and decrease the weights otherwise, we could always be consistent: this 
however requires the (potentially empirical) knowledge of J and this is exactly the idea behind the 
adaptive scheme that we present in Section |4l Before looking at these extensions, we discuss in the 
next Section, qualitative differences between our results and the corresponding ones for the Lasso. 



2.4 Refinements of Consistency Conditions 

Our current results state that the strict condition (0]) is sufficient for joint consistency of the group 
Lasso, while the weak condition ^ is only nece ssary. When all grou ps have dimension one, then 
the strict condition turns out to be also necessary (lYuan and LinlboOTh . 

The main technical reason for those differences is that in dimension one, the set of vectors 
of unit norm is finite (two possible values), and thus regular squared norm consistency leads to 



estimates of the signs of the loadings (i.e., their normalized versions wj 



I which are ultimately 



constant. When groups have size larger than one, then Wj/\\wj \\ will not be ultimately constant (just 
consistent) and this added dependence on data leads to the following refinement of Theorem |2] (see 
proof in Appendix IB. 31 ): 

Theorem 4 Assume fAUHS]). Assume the weak condition (O is satisfied and that for all i € J'^ such 
that j- J:x,Xj^x]xj Diag(c?j/||vi^j||)wj 



1, we have 



dj/\\wj\ 



Pj T 



A > 0, 



(6) 



-1 



Diag(dj/||wj ||)wj. Then for any sequence such that — > and Xnn^^^ 



with A = 

+00, then the group Lasso estimate w defined in Eq. ([7]) converges in probability to w and the group 
sparsity pattern Ji^w) = {j, Wj ^ 0} converges in probability to J. 

This theorem is of lower practical significance than Theorem |2] and Theorem [3] It merely shows 
that the link between strict/weak conditions and sufficient/necessary conditions are in a sense tight 
(as soon as there exists j G J such that pj > 1, it is easy to exhibit examples where Eq. Q is or is 
not satisfied). The previous theorem does not contradict the fact that condition (O is necessary for 

path-consistency in the Lasso case: indeed, if wj has dimension one, then Ip. — ^t^. is always 

equal to zero, and thus Eq. Q is never satisfied. Note that when condition ^ is an equality, we 
could still refine the condition by using higher orders in the asymptotic expansions presented in 
Appendix IB. 3 1 

We can also further refined the necessary condition results in Theorem[3l as stated in Theorem|3l 
the group Lasso estimator may be both consistent in terms of norm and sparsity patterns only if the 
condition ^ is satisfied. However, if we require only the consistent sparsity pattern estimation. 
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then we may allow the convergence of the regularization parameter A„ to a strictly positive limit Aq. 
In this situation, we may consider the following population problem: 



^ m 

min -(w - w^T^xxiw - w) + Xoy^ dj\ 



Wi\\. (7) 



If there exists Ag > such that the solution has the con^ect spai^sity pattern, then the group Lasso 
estimate with A^ —>■ Aq, will have a consistent spai^sity pattern. The following proposition, which 
can be proved with standard M-estimation arguments, make this precise: 

Proposition 5 Assume fAlUISl). If Xn tends to Aq > 0, then the group Lasso estimate w is sparsity- 
consistent if and only if the solution ofEq. ([71) has the correct sparsity pattern. 

Thus, even when condition dS]) is not satisfied, we may have consistent estimation of the sparsity 
pattern but inconsistent estimation of the loading vectors. We provide in Section |5] such examples. 

2.5 Probability of Correct Pattern Selection 

In this section, we focus on regularization parameters that tend to zero, at the rate 

i.e., 

An = Aora^^^^ with Aq > 0. For this particular setting, we can actually compute the limit of the 
probability of correct pattern selection (proposition proved in Appendix IB. 4b . Note that in order to 
obtain a simpler result, we assume constant conditional variance of Y given w^X: 

Proposition 6 Assume fAUHS]) and var(y|w^x) = o"^ almost surely. Assume moreover A„ = 
Ao^"^/^ with Aq > 0. Then, the group Lasso w converges in probability to w and the probability 
of correct sparsity pattern selection has the following limit: 



1 

max — 



^ 1 , (8) 



where t is normally distributed with mean zero and covariance matrix J^XjcXjclXj = SxjcXjc — 
^^XjcXjS^f^^^SxjXjc (which is the conditional covariance matrix ofXjc given Xj). 

The previous theorem states that the probability of correct selection tends to the mass under a non 
degenerate multivariate distribution of the intersection of cylinders. Under our assumptions, this 
set is never empty and thus the limiting probability is strictly positive, i.e., there is (asymptotically) 
always a positive probability of estimating the correct pattern of groups. 

Moreover, additional insights may be gained from Proposition |6l namely in terms of the depen- 
dence on a, Aq and the tightness of the consistency conditions. First, when Aq tends to infinity, then 
the limit defined in Eq. dSjl tends to one if the strict consistency condition dUl is satisfied, and tends 
to zero if one of the conditions is strictly not met. This corroborates the results of Theorem |2] and |3] 
Note however, that only an extension of Proposition |6] to A„ that may deviate from a n~^/^ would 
actually lead to a proof of Theorem |2j which is a subject of ongoing reseaixh. 

Finally, Eq. ([8]l shows that a has a smoothing effect on the probability of correct pattern se- 
lection, i.e., if condition dJ]) is satisfied, then this probability is a decreasing function of a (and an 
increasing function of Aq). Finally, the stricter the inequality in Eq. dUl, the larger the probability of 
correct rank selection, which is illustrated in Section |5]on synthetic examples. 
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2.6 Loading Independent Sufficient Condition 

Condition (01) depends on the loading vector w and on the sparsity pattern J, which are both a priori 
unknown. In this section, we consider sufficient conditions that do not depend on the loading vector, 
but only on the sparsity pattern J and of course on the covariance matrices. The following condition 
is sufficient for consistency of the group Lasso, for all possible loading vectors w with sparsity 
pattern J: 



C{J:xx,d,J) 



max 



max 

VjgJ, \\uj\ 



< 1. 



(9) 



As opposed to the Lasso case, C{T,xx,d,3) cannot be readily computed in closed form, but 
we have the following upper bound: 



C(^xx,d, J) ^ max — d 



fcGJ 



kj 



where for a matrix M, ||M|| denotes its maximal singular value (also known as its spectral norm). 
This leads to th e following sufficien t condition for consistency of the group Lasso (which extends 
the condition of I Yuan and Linl. 120071) : 



max 

ieJ'^ d. 



^-1 
'X3X3 



fceJ 



< 1. 



(10) 



Given a set of weights d, better suf ficient conditions than Eg. (fTOb may be obtained by solving a 
semidefinite programming problem (|Boyd and Vandenberghd . l2003h : 



Proposition 7 The quantity max 

VjgJ, \\uj I 



J^x^j^xlx-, Diag(c?j)nj 



is upperbounded by 



max trM Diaff(d,)S 



1 

X3X3 



(11) 



where M is a matrix defined by blocks following the block structure of Y,XjXj- Moreover, the bound 
is also equal to 



mm 



x\x, Diag{dj)=<;Diag{A) 



Proof We let denote M = uu^ )p 0. Then if all Uj for j G J have norm \, then we have tiMjj = 1 
for all j G J. This implies the convex relaxation. The second problem is easily obtained as the 
convex dual of the first problem (IBoyd and Vandenberghel . l2003f) . ■ 

Note that fo r the Lasso, the convex bound in Eg. (flTI ) is tight and le ads to the bound giv en above 
in Eq. (ITOl ) ( Yuan and Lin . 2007 . Wainwright . 2006). For the Lasso, Zhao and Yu ( 20061) consider 
several particular patterns of dependencies using Eq. (fTOl) . Note that this condition (and not the 
condition in Eq. (O) is independent from the dimension and thus does not readily lead to rules of 
thumbs allowing to set the weight dj as a function of the dimension pj ; several rules of thumbs have 
been suggest ed, that loosely depe nd on the dimension on the blocks, in the con text of the linear 
group Lasso ( Yuan and Lin . 20061) or multiple kernel learning ( Bach et al l l2004bh : we argue in this 
paper, that weights should also depend on the response as well (see Section |4|. 
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2.7 Alternative Formulation of the Group Lasso 



Following iBach et alj (l2004ar) . we can instead consider regularization by the squaie of the block 
^i-norm: 



mill — \\Y 



Xw — bin I 



^dj\\Wj\ 



This leads to the same path of solutions, but it is better behaved because each variable which is 
not zero is still regularized by the squared norm. The alternative version has also two advan- 
tages: (a}JHias_very close l inks to more general frameworks for learning the kernel matrix from 
data ( Lanckriet et al . 2004bl ). and (b) it is essential in our proof of consistency in the functional 
case. We also get the equivalent formulation to Eq. ([T]), by minimizing in closed form with respect 
to h, to obtain: 



mm 



1 . . 1 . 1 / ™ V 

-Syy - T^YXW + -W^llxXW + -^In ^ dj \\Wj \\ 



(12) 



The following proposition gives the optimality conditions for the convex optimization problem de- 
fined in Eq. (fT2l) (see proof in Appendix IA.2I) : 

Proposition 8 A vector w G 
if and only if 



' with sparsity pattern J = {j, wj ^ 0} is optimal for problem ([72 



Vj G J, tx^XW - txjY = -IJ-n (J21=l 



\\Wi II 



(13) 
(14) 



Note the correspondence at the optimum between optimal solutions of the two optimization prob- 
lems in Eq. ([T]) and Eq. (fT2l ) through A„ = /x„ {Y17=i di\\wi\\). As far as consistency results are 
concerned, Theorem |3]immediately applies to the alternative formulation because the regularization 
paths are the same. For Theorem |2l it does not readily apply. But since the relationship between 
A„, and /x„ at optimum is = /i^ (X^^Li c^iH^^ill) ^^'^ that Y17=i di\\wi\\ converges to a constant 
whenever w is consistent, it does apply as well with minor modifications (in particular, to deal with 
the case where J is empty, which requires /x„ = oo). 



3. Covariance Operators and Multiple Kernel Learning 

We now extend the previous consistency results to the case of non-parametric estimation, where each 
group is a potentially infinite dimensional space of functions. Namely, the non parametric group 
Lasso aims at estimating a sparse linear combination of functions of separate r andom variables, and 



can th en be seen as a variable selection method in a generalized additive model (|Hastie and Tibshirani 



19901) . Moreover, as shown in Section 13. 5[ the non-parametric group Lasso may also be seen as 
equivalent to learning a convex combination of kernels, a framework referred to as multiple kernel 
learning (MKL). In this context it is customary to have a single input space with several kernels (an d 



hence Hilbert spaces) defined on the same input space (ILanckriet et al.Ll2004bl.lBach et al.l.l2004ar) . 
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Our framework accomodates this case as well, but our assumption (AH)) regarding the invertibility 
of the joint correlation operator states that the kernels cannot span Hilbert spaces which intersect. 
In this nonpaiametric context, covaiiance operators constitute appropriate tools for the statistical 



analysis and are becoming standard in the theoret i cal analysis of kernel methods (IFukumizu et al 



20041 . lOretton et all boosl iFukumizu etaU boO?! . ICaponnetto and de Vitol boosh The fo llowing 
section reviews important concepts. For more details, see lBaken (119731 ) and IFukumizu et al.l (|2004r) . 



3.1 Review of Covariance Operator Theory 

In this section, we first consider a single set X and a positive definite kernel k : X x X M., 
asso ciated with the reproducing k ern el Hilbert space (RKHS) of funct ions from <Y to M (see, 
e.g., IScholkopf and Smolal (1200 ih or iBerlinet and Thomas-AgnanI (120031) for an introduction to 
RKHS theory). The Hilbert space and its dot product are such that for all x S X, then 

k{-,x) G and for all / € JT, {k{-,x),f)jr = f{x), which leads to the reproducing property 
{k{-,x), k{-,y))yr = k{x, y) for any (x, y) <^ X y. X. 

Covariance operator and norms Given a random variable X on X with bounded second order 
moment, i.e., such that 'Ek{X, X) < oo, we can define the covariance operator as the bounded 
linear operator Sjcx from to JT such that for all (/, g) ^ T y. T, 

if, ^xxg)r = cov{f{X),g{X)) = E{f{X)g{X)) - {Ef{X)){Eg{X)). 

The operator Sxx is auto-adjoint, non-negative and Hilbert-Schmidt, i.e., for any orthonormal basis 
(ep)pj>i of T, then W^xxepWj^ is finite; in this case, the value does not depend on the chosen 

basis and is referred to as the square of the Hilbert-Schmidt norm. The norm that we use by default 
in this paper is the operator norm ||Sxx||.:f = supjgjp ||/||;p=i which is dominated by 

the Hilbert-Schmidt norm. Note that in the finite dimensional case where X = MP, p > and the 
kernel is linear, the covariance operator is exactly the covariance matrix, and the Hilbert-Schmidt 
norm is the Frobenius norm, while the operator norm is the maximum singular value (also referred 
to as the spectral norm). 

The null space of the covariance operator is the space of functions f £ such that var f{X) = 
0, i.e., such that / is constant on the support of X. 

Empirical estimators Given data Xi G X ,i = 1, . . . ,n sampled i.i.d. from Px, then the empir- 
ical estimate T,xx of T,xx is defined such that (/, T,xxg)r is the empirical covariance between 
f{X) and g{X), which leads to: 

^ n 1 " 1 

Sxx = - Xi) ® k{-,Xi) k{-, Xj) (g) - k{-,Xi), 



i=l 



n 



1=1 



n 



i=l 



where u®v is the operator defined by (/, {u^v)g)jr = (/, u)jr{g^ v)jr. If we further assume that the 
fourth order moment is finite, i.e., E fc(X, XY < oo, then t he estimate is uniformly consistent i.e., 
— Sxxlljf = Op{n~^/^) (see IFukumizu et all ( 2007b and Appendix lC.il ). which generalizes 
the usual result of finite dimensionJl 



3. A random varia ble Zn is said to be o f order Op{an) if for any 77 > 0, there exists iVf > such that sup„ P(|Z„ [ > 
Man) < V- See I Van der VaarJ ( Il998[) for further definitions and properties of asymptotics in probability. 
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Cross-covariance and joint covariance opera tors Covaria nce operator theory can be extended 
to cases with more than one random variables ( Bakej. 1973 ). In our situation, we have m input 
spaces Afi , . . . , and m random variables X = [Xi , . . . , X^) and m RKHS !Fi, . . . , Tm asso- 
ciated with m kernels ki, . . . ,km- 

If we assume that Kkj{Xj,Xj) < oo, for all j = 1, . . . , m, then we can naturally define the 
cross-covariance operators Sx^Xj from to J^i such that V(/j, fj) ^ J^i x T'j, 

= cov(/i(X,),/,(X,-)) = nfi{Xi)fj{X,)) - (E/,(X,))(E/,-(X,)). 

These are also Hilbert-Schmidt operators, and if we further assume that 'Ekj{Xj,Xj)'^ < oo, for 
all j = 1, . . . , m, then the natural empirical estimators converges to the population quantities in 
Hilbert-Schmidt and operator norms at rate Op(n~^/^). We can now define a joint block covariance 
operator on ^ = x • • • x JF^ following the block structure of covariance matrices in Section |2l 
As in the finite dimensional case, it leads to a joint covariance operator T^xx and we can refer to 
sub-blocks as Y^XiXj for the blocks indexed by / and J. 

Moreover, we can define the bounded (i.e., with finite operator norm) correlation operators 

1/2 1/2 J I I L 

through TiXiX- = ^x x-^^i^ '^x-x- (IB aken . 119731) . Throughout this paper we will make the as- 
sumption that those operators CxiXj are compact for i ^ j: compact operators can be characterized 
as limits of finite rank operators or as operators that can be di agonalized on a countable basis with 
spectrum composed of a sequence tending to zero (see, e.g., lBrezisl.ll98(]h . This implies that the 
joint operator Cxx, naturally defined on = x • • • x Tm, is of the form "identity pl us com- 
pact" . It thus has a minimum and a maximum eigenvalue which are both between and 1 (|Brezisl. 



19801) . If those eigenvalues are strictly greater than zero, then the operator is invertible, as are all the 



square sub-blocks. Moreover, the joint correlation operator is lower-bounded by a strictly positive 
constant times the identity operator. 

Translation invariant kernels A particularly interesting ensemble of RKHS in the context of 
nonparametric estimation is the set of translation invariant kernels defined over X = MP, where 
p ^ 1, of the form k{x, x') = q{x' — x) where g is a function on W with pointwise nonnegative 
integrable Fourier transform (which implies that q is continuous). In this case, the associated RKHS 
is ^ = {qi/2 * 9, 9 ^ L'^{MP)}, where qi/2 denotes the inverse Fourier transform of the square 
root of the Fourier transform of q and * denotes the convolution operation, and L'^{W) denotes the 
space of square integrable functions. The norm is thenequal to 



Q{u) 



where F and Q are the Fourier transforms of / and q (IWahbal . ll990l . lScholkopf and Smolal . 120011) . 
Functions in the RKHS are functions with appropriately integrable derivatives. In this paper, when 
using infinite dimensional kernels, we use the Gaussian kernel k{x, x') = q{x — x') = exp(— 6||x — 



„'l|2 



)• 



One-dimensional Hilbert spaces In this paper, we also consider real random variables Y and e 
embedded in the natural Euclidean structure of real numbers (i.e., we consider the linear kernel on 
W). In this setting the covariance operator Sx^y from M to jFj can be canonically identified as an 
element of Tj . Throughout this paper, we always use this identification. 
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3.2 Problem Formulation 



We assume in this section and in the remaining of the paper that for each j = 1, . . . ,m, Xj Xj 
where Xj is any set on which we have a reproducible kernel Hilbert spaces Tj, associated with the 
positive kernel kj : Xj x Xj —>■ M. We now make the following assumptions, that extends the 
assumptions (AH]), (A|2l) and (A|3l). For each of them, we detail the main implications as well as 
common natural sufficient conditions. The first two conditions (A|4| and (AlHl depend solely on the 
input variables, while the two other ones, (A|6l) and (A|7]l consider the relationship between X and 
Y. 



(A4) For each j = 1 . . . ,m, Tj is a separable reproducing kernel Hilbert space associated with 
kernel kj, and the random variables kj{-,Xj) are not constant and have finite fourth-order 
moments, i.e., Kkj{Xj,Xjy < oo. 

This is a non restrictive assumption in many situations; for example, when (a) Xj = M^j and 
the kernel function (such as the Gaussian kernel) is bounded, or when (b) Xj is a compact subset of 
Wj and the kernel is any continuous function such as linear or polynomial. This implies notably, 
as shown in Section [3?n that we can define covariance, cross-cov ariance and con^elation operators 



that are all Hilbert-Schmidt (|Bakeii. I1973L iFukumizu et all. l2007h and can all be estimated at rate 



-1/2N 



in operator norm. 



Op{n 

(AS) All cross-correlation operators are compact and the joint correlation operator Cxx is invert- 
ible. 



This i s also a condition uniquely on the input spaces and not on Y. Following iFukumizu et al. 
(120071) . a simple sufficient condition is that we have measurable spaces and distributions with joint 
density px (and marginal distributions px^ixi) and pXiXj{xi,Xj)) and that the mean square con- 
tingency between all pairs of variables is finite, i.e. 



E 



PXiXjiXi,Xj) 
PXi{xi)pXj{Xj) 



1 ^ < oo. 



The contingency is a measure of statistical dependency (lRenyil . ll959l) . and thus this sufficient con- 
dition simply states that two variables Xj and Xj cannot be too dependent. In the context of mul- 
tiple kernel learning for heterogeneous data fusion, this corresponds to having sources which are 
heterogeneous enough. On top of compacity we impose the invertibility of the joint correlation 
operator; we use this assumption to make sure that the functions fi , . . . , are unique. This en- 
sures the non existence of any set of functions /i, ...,/„ in the closures of Ti, . . . , Tm, such that 
var fj{Xj) > and a linear combination is constant on the support of the random variables. In the 
context of g eneraUzed additive models, th is assumption is referred to as the empty concurvity space 
assumption ( Hastie and Tibshirani . 1990h . 



(A6) There exists functions f = (fi, . . . , f^) G JT = .Fi x • • • x b € M, and a function h 
of X = {Xi,...,Xra) suchthatE(y|X) = Y.J^^tj{Xj) + h + \i{X) with E/i(X)2 < oo, 
E/i(X) = and Eh(X)/j(Xj) = for all j = 1, . . . ,m and fj G Tj. We assume that 
E((y — f (X) — b)^|X) is almost surely greater than a■'^^^^ > and smaller than cr^ax < 
We let denote J = {j, fj / 0} the sparsity pattern of f . 
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This assumption on the conditional expectation of Y g iven X is not the most general and follows 

common assumptions in approximation theory (see, e.g. JCaponnetto and de Vitd (120051) . ICucker and Smale 
(I2OO2I) and references therein). It allows misspecification, but it essentially requires that the con- 
ditional expectation of Y given sums of measurable functions of Xj is attained at functions in 
the R KHS, and not merely me asurable functions. Dealing with more general assumptions in the 
line of iRavikumar et all (120081) requires to consider con sistency for norms weaker than the RKHS 
norms ( Caponnetto and de Vitd. 2005 . Steinwart . 2001 ). and is left for future research. Note also, 



that to simplify proofs, we assume a finite upper-bound cr^^x '^^ the residual variance. 

1/2 

(A7) For all j G {1, . . . , m}, there exists gj G J^j such that f, = g-,, i.e., each is in the 

range of S^^^. 



This technical condition, already used by Caponnetto and de Vitd ( 2005 ). which concerns all RKHS 
independently, ensures that we obtain consistency for the norm of the RKHS (and not another 
weaker norm) for the least-squares estimates. Note also that it implies that var fj{Xj) > 0, i.e., fj 
is not constant on the support of Xj. 

This assumption might be checked (at least) in two ways; first, if (ep)p^i is a sequence of 
eigenfunctions of Sxx> associated with strictly positive eigenvalues Ap > 0, then / is in the range of 
Tjxx if and only if / is constant outside the support of the random variable X and J2p>i J~{f^ ^p)^ 
is finite (i.e, the decay of the sequence (/, Cp)^ is strictly faster than Ap). 

We also provide another sufficient condition that sheds additional light on this technical con- 
dition which is always true for finite dimensional Hilbert spaces. For the common situation where 
Xj = W^, (the marginal distribution of Xj) has a density px^ (xj) with respect to the Lebesgue 



measure and the kernel is of the form kj{x 
(proved in Appendix ID.4I ): 



3,-ljj 



qj{xj — x'-), we have the following proposition 



Proposition 9 Assume X = W and X is a random variable on X with distribution Px that has a 
strictly positive density px{x) with respect to the Lebesgue measure. Assume k{x, x') = q{x — x') 
for a function q G L?'{W') has an integrable pointwise positive Fourier transform, with associated 
RKHS J-. If f can be written as f = q * g (convolution of q and g) with g{x)dx = and 



/rp px{x) '^•^ ^ ^^^'^ f £ J- is in the range of the square root ^xx covariance operator. 

The previous proposition gives natural conditions regarding / and px- Indeed, the condition 
/ txlx) ^ °° corresponds to a natural support condition, i.e., / should be zero where X has 
no mass, otherwise, we will not be able to estimate /; note the similarity w ith the usual condition 
regarding the variance of importance sampling estimation (jBremaudl. Il999h . Moreover, / should 
be even smoother than a regular function in the RKHS (convolution by q instead of the square root 
of q). Finally, we provide in Appendix |E] detailed covariance structures for Gaussian kernels with 
Gaussian variables. 

Notations Throughout this section, we refer to functions / = (/i, . . . , /„) G = JF^ x • • • x Trn 
and the joint covariance operator Sxx- In the following, we always use the norms of the RKHS. 
When considering operators, we use the operator norm. We also refer to a subset of / indexed by J 
through fj. Note that the Hilbert norm \\fj\\j^j is equal to \\fj\\j^j = (YljeJ ll/ill^j)^^^- Finally, 
given anonnegative auto-adjoint operator S, we let denote 5^/^ its nonnegative autoadjoint square 
root (lBakei] . ll973l) . 



.1/2 



13 



3.3 Nonparametric Group Lasso 

Given i.i.d data {xij,yi), i = 1, . . . , n, j = 1, . . . , m, where each Xij € Af,, our goal is to estimate 
consistently the functions fj and which of them are zero. We let denote Y G M" the vector of 
responses. We consider the following optimization problem: 

2 / \ 2 



i=l \ j=l 



By minimizing with respect to b in closed form, we obtain a similar formulation to Eq. ([T2l ). where 
empirical covariance matrices are replaced by empirical covariance operators: 



min ^Eyy - (/, txY):F + ^ (/, ^xxf)r + ^ 



(15) 



We let denote / any minimizer of Eq. (fTSl) . and we refer to it as the non parametric group Lasso 
estimate, or also the multiple kernel learning estimate. By Proposition [T3l the previous problem has 
indeed minimizers, and by Proposition [14] this global minimum is unique with probability tending 
to one. 

Note that formally, the finite and infinite dimensional formulations in Eq. (fT2l) and Eq. (031) 
are the same, and this is the main reason why covariance operators are very practical tools for the 
analysis. Furthermore, we have the corresponding proposition regarding optimality conditions (see 
proof in Appendix IA.3b : 



Proposition 10 A function f G with sparsity pattern J 
problem f lTTl ) if and only if 



J{f) = {j> fj + 0} is optimal for 



Vj G J^ 



Vi G J, 



^x,xf 



^x,xf 



^ l^ndj {Y.i=idi\\ 

-/^n(Er=lf^i|l/i||.^i 



ll/.lk- 



(16) 



(17) 



A consequence (and in fact the first part of the proof) is that an optimal function / must be in the 
range of Sxy and Sxx> i-C, an optimal / is supported by the data; that is, each fj is a linear com- 
bination of functions Xij), i = 1, . . . , n. This is a rather circumvoluted way of presenting the 
representer theorem ( Wahba . 1990h . but this is the easiest for the theoretical analysis of consistency. 
However, to actually compute the estimate / from data, we need the usual formulation with dual 
parameters (see Section 1331 ). 

Moreover, one important conclusion is that all our optimization problems in spaces of functions 
can be in fact transcribed into finite-dimensional problems. In particular, all notions from multivari- 
ate differentiable calculus may be used without particular care regarding the infinite dimension. 



3.4 Consistency Results 

We consider the following strict and weak conditions, which correspond to condition Q and © in 
the finite dimensional case: 

1 



max — 



^x'xCx.x.^xlx, Diag(d,/||f, )gj 



< 1, 



(18) 
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max — 



1 „l/2 



^ 1, (19) 

di 



where Diag((ij/||fj HjTj ) denotes the block-diagonal operator with operators p j j^ I jr. on the diag- 
onal. Note that this is well-defined because Cxx is invertible and that it reduces to Eq. Q and 
Eq. ([5]) when the input spaces Xj, j = 1, . . . , m are of the form and the kernels are linear. 
The main reason is rewriting the conditions in terms of correlation operators rather than covariance 
operators is that coiTclation operators are invertible by assumption, while covariance operators are 
not as soon as the Hilbert spaces have infinite dimensions. The following theorems give necessary 
and sufficient conditions for the path consistency of the nonpai'ametric group Lasso (see proofs in 
Appendix IC .21 and Appendix IC. 31 ): 

Theorem 11 Assume (AIHIT]) and that J is not empty. If condition f li8D is satisfied, then for any 
sequence fin such that fin ^ and finTi^^'^ — > +oo, any sequence of nonpar ametric group Lasso 
estimates f converges in probability to f and the sparsity pattern J{f) = {j, fj 7^ 0} converges in 
probability to J. 

Theorem 12 Assume fAl^HT]) and that J is not empty. If there exists a (possibly data-dependent) 
sequence fin such f converges to f and J converges to J in probability, then condition di9D is 
satisfied. 

Essentially, the results in finite dimension also hold when groups have infinite dimensions. We 
leave the extensions of the refined results in Section 12.41 to future work. Condition (fTSl ) might be 
hai^d to check in practice since it involves inversion of coiTclation operators; see Section [3^ for an 
estimate from data. 



3.5 Multiple Kernel Learning Formulation 

Proposition [TOldoes not readily l ead to an algorithm for computing the estimate /. In this sec- 



tion, f ollowing iBach et al.l (l2004ar) . we link the group Lasso to the multiple kernel learning frame- 
work ( Lanckriet et al. . 2004bl) . Problem (fTSl ) is an optimization problem on a potentially infinite 
dimensional space of functions. However, the following proposition shows that it reduces to a finite 
dimensional problem that we now precise (see proof in Appendix IA.4I ): 

Proposition 13 The dual of problem f lTTl ) is 

/ 1 llv l|2 1 a^K^X 
max < ||r — n^„a|| max -p^ — > , (20) 

am", 0^1^=0 [ 2n 2fJ,n i=i,---,m df J 

where {Ki)ab = ki{xa,Xb) are the kernel matrices in W^^"^, for i = 1, . . . ,?7i. Moreover, the 
dual variable a G M" is optimal if and only if 1„ = and there exists rj G such that 

VjKj + nfinin \ a = Y, (21) 

KjU Kia 
Vj G j 1, . . . , m k -tt'— < max ^5 — =^ rij = 0. (22) 

The optimal function may then be written as fj = rfj '}21l=i ctikj{-,x. 
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Since the problem in Eq. (l20l ) is strictly convex, there is a unique dual solution a. Note that Eq. (|2TI ) 
corresponds to the optimality conditions for the least-square problem: 



min — Svy 



E ■ 

i, '?i>o 



f-l|2 



whose dual problem is: 



max 



2n' 



11^ 




and unique solution is a = (X^jli ^j^j + iT'fJ-nIn)~^Y ■ That is, the solution of the MKL prob- 
lem leads to dual parameters a and set of weigh ts 77 > such that a is the solution to the least- 
squai^e problem with kernel K = "^^^iVj^j- iBach et alj (l2004ar) has shown in a very similar- 
context (hinge loss instead of the square loss) that the optimal 77 in Proposition [13] can be obtained 
as the minimizer of the optimal value of the regularized least-square problem with kernel matrix 



J(r/) = max 



1 

'2n' 



\Y 



1 



2lJ,r, 



-a 



a 



with respect to 7? ^ such that T^Jj'-i Vjd'j = 1 . This formulation allows to der i ve pro bably ap- 
proximately correct error bounds (ILanckriet et al.l.l2004bl . lBousquet and Herrmannl.l2003h . Besides, 
this formulation allows rj to be negative, as long as the matrix Yl^=i "Hj^j positive semi-definite. 
However, theoretical advantages of such a possibility still remain uncleai\ 

Finally, we state a corollary of Proposition [T3] that shows that under our assumptions regarding 
the con^elation operator, we have a unique solution to the non parametric groups Lasso problem with 
probability tending to one (see proof in Appendix IA.5I) : 

Proposition 14 Assume (ASUS]). The problem di5D has a unique solution with probability tending 
to one. 



3.6 Estimation of Correlation Condition (UHl ) 

Condition ^ is simple to compute while the non parametric condition dTS] ) might be hard to check 
even if all densities are known (we provide however in Section |5] a specific example where we 
can compute in closed form all covariance operators). The following proposition shows that we 



can consistently estimate the quantities 
sample (see proof in Appendix IC.4I ): 

Proposition 15 Assume (AIHIT]), and k„ 



and Kn'Ti^^'^ 00. Let 



-1 



given an i.i.d. 



a 



vie J 
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andfjj = ■j:{a^ KjaY^'^. Then,foralli G J'^, thenorm ^x^Xi^^i^j^x^Xj / \\^j\\)SJ 

is consistently estimated by: 



a 



(23) 



4. Adaptive Group Lasso and Multiple Kernel Learning 

In previous sections, we have shown that specific necessary and sufficient conditions are needed 
for path consistency of the group L asso and m ultiple kernel learning. The following procedures, 
adapted from the adaptive Lasso of IZou (l2006h . lead to two-step procedures that always achieve 
both consistency, with no condition such as Eq. ^ or Eq. (fTSl ). As before, results are a bit different 
when groups have finite sizes and groups may have infinite sizes. 



4.1 Adaptive Group Lasso 



The following theorem extends the similar theorem of IZoul (|2006r) . and shows that we can get both 
Op(n~^/^) consistency and coiTcct pattern estimation: 

Theorem 16 Assume fAHHS]) and 7 > 0. Let w^^ = S^^Sxy denote the (unregularized) least- 
square estimate. Let denote any minimizer of 



-Syy - ^YXW + -W T,xxW + ^ 



Ifn ^ fin ^ n then w"^ converges in probability to w, J^w^) converges in proba- 

bility to J, and n^/'^{wj — wj) tends in distribution to a normal distribution with mean zero and 



covariance matrix Tij^_^-^^. 



This theorem, proved in Appendix ID.ll shows that the adaptive group Lasso exhibit all important 
asymptotic properties, both in terms of errors and selected models. In the nonparametric case, we 
obtain a weaker result. 



4.2 Adaptive Multiple Kernel Learning 

We first begin with the consistency of the least-square estimate (see proof in Appendix ID. 21 ): 
Proposition 17 Assume (P^S}- The unique minimizer fj^f of 

m 

-tyy - {txY, f)T + 2 (/, ^Xxf)T + y ^ W^^^ ' 
converges in probability to f if Kn — > and Kn'n^^'^ 0. Moreover, we have \\f^f — f\\:F 



1„-1/2N 
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Since the l east-square estimate is consistent and we have an upper bound on its convergence 
rate, we follow IZou and use it to defined adaptive weights dj for which we get both sparsity 



and regular consistency without any conditions on the value of the correlation operators. 

Theorem 18 Assume (A^^ and^ > 1. Let be the least-square estimate with regularization 

parameter proportional to n~^/'^, as defined in Proposition\17\ Let denote any minimizer of 



Then f"^ converges to f and J{f^) converges to J in probability. 

Theorem [18] allows to set up a specific vector of weights d. This provides a principled way to 
define data adaptive weights, that allows to solve (at least theoretically) the potential consistency 
problems of the usual MKL framework (see Section[5]for illustration on synthetic examples). Note 
that we have no result concerning the Op(n~^/^) consistency of our procedure (as we have for the 
finite dimensional case) and obtaining precise convergence rates is the subject of ongoing research. 

The following proposition gives the expression for the solution of the least-square problem, 
necessary for the computation of adaptive weights in Theorem [TSl 

Proposition 19 The solution of the least-square problem in Proposition\T7\is given by 

n / m \ 

Vj € {1, ... , m}, fj'^ = ^ aikj{-,Xij) with a = n„ ^ UnKjUn + uKnln n„y, 

t=l \ 7 = 1 / 



^ 1 /2 

with norms \\F^^\\jr. = [a^ Kjo) , j 

Other weighting schemes have been suggested, based on various heuristic s. A notable one (whic h 
we use in simulations) is the normalization of kernel matrice s by their tr ace ( Lanckriet et al. , 2004bl) . 



, m. 



which leads to dj = (trSx.Xj)^/^ = {^trUnKjUn)^^'^. Bach et all (l2004bh have observed em- 
pirically that such normalization might lead to suboptimal solutions and consider weights dj that 
grow with the empirical ranks of the kernel matrices. In this paper, we give theoretical arguments 
that indicate that weights which do depend on the data are more appropriate and work better (see 
Section[5]for examples). 



5. Simulations 

In this section, we illustrate the consistency results obtained in this paper with a few simple simula- 
tions on synthetic examples. 

5.1 Groups of Finite Sizes 

In the finite dimensional group case, we sampled X € from a normal distribution with zero 
mean vector and a covariance matrix of size p = 8 for m = 4 groups of size pj = 2, j = 1, . . . ,m, 
generated as follows: (a) sample anp x p matrix G with independent standard normal distributions. 
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consistent - non adaptive consistent - adaptive (y = 1 ) 




5 10 5 10 15 

-log(n) -log(n) 

Figure 1: Regularization paths for the group Lasso for two weighting schemes {left: non adaptive, 
right: adaptive) and thi^ee different population densities {top: strict consistency condition 
satisfied, middle: weak condition not satisfied, no model consistent estimates, bottom: 
weak condition not satisfied, some model consistent estimates but without regular con- 
sistency). For each of the plots, plain curves correspond to values of estimated 7)j, dotted 
curves to population values and bold curves to model consistent estimates. 
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(b) form Sxx = GG^ , (c) for each j G {1, • • • rescale Xj e so that trSx^Xj = 1- 
We selected Card( J) = 2 groups at random and sampled non zero loading vectors as follows: (a) 
sample each loading from from independent standard normal distributions, (b) rescale those to unit 
norm, (c) rescale those by a scaling which is uniform at random between ^ and 1. Finally, we chose 
a constant noise level of standard deviation a equal to 0.2 times {E,{w^ XYY^"^ and sampled Y 
from a conditional normal distribution with constant variance. The joint distribution on {X, Y) thus 
defined satisfies with probability one assumptions (All]|3l). 

For cases when the con^elation conditions (01) and ^ were or were not satisfied, we consider 
two different weighting schemes, i.e., different ways of setting the weights dj of the block ^i-norm: 
unit weights (which correspond to the unit trace weighting scheme) and adaptive weights as defined 
in Section m 

In Figure [H we plot the r egulari zation paths coiTcsponding to 200 i.i.d. samples, computed by 
the algorithm of iBach et al.l (l2004br) . We only plot the values of the estimated variables fjj,] = 
1, . . . , m for the alternative formulation in Section 12771 which are proportional to \\wj\\ and normal- 
ized so that Yl%^i "^i — 1- compare them to the population values rjj-. both in terms of values, 
and in terms of their sparsity pattern {rjj is zero for the weights which are equal to zero). Figure[I]il- 
lusti'ates several of our theoretical results: (a) the top row corresponds to a situation where the strict 
consistency condition is satisfied and thus we obtain model consistent estimates with also a good 
estimation of the loading vectors (in the figure, only the good behavior of the norms of these loading 
vectors are represented); (b) the right column con^esponds to the adaptive weighting schemes which 
also always achieve the two type of consistency; (c) in the middle and bottom rows, the consistency 
condition was not satisfied, and in the bottom row the condition of Proposition Figure[T]that ensures 
that we can get model consistent estimates without regular consistency, is met, while it is not in the 
middle row: as expected, in the bottom row, we get some model consistent estimates but with bad 
norm estimation. 

In Figure |2l |3] and IH we consider the three joint distributions used in Figure [T] and compute 
regularization paths for several number of samples (10 to 10^) with 200 replications. This allows 
us to estimate both the probability of correct pattern estimation ¥{J{w = J) which is considered in 
Section 1231 and the logarithm of the expected enw logE||tt; — w|p. 

From Figure |2l it is worth noting (a) the regular spacing between the probability of correct 
pattern selection for several equally spaced (in log scale) numbers of samples, which corroborates 
the asymptotic result in Section 1231 Moreover, (b) in both row, we get model consistent estimates 
with increasingly smaller norms as the number of samples grow. Finally, (c) the mean square errors 
are smaller for the adaptive weighting scheme. 

From Figure [3l it is worth noting that (a) in the non adaptive case, we have two regimes for the 
probability of correct pattern selection: a regime corresponding to Proposition |6] where this proba- 
blity can take values in [0, 1) for increasingly smaller regularization parameters (when n grows); and 
a regime corresponding to non vanishing limiting regularization parameters corresponding to Propo- 
sition [51 we have model consistency without regular- consistency. Also, (b) the adaptive weighting 
scheme allows both consistencies. In Figure |3lhowever, the second regime (correct model estimates, 
inconsistent estimation of loadings) is not present. 

In Figure [51 we sampled 10,000 different covariance matrices and loading vectors using the 
procedure described above. For each of these we computed the regularization paths from 1000 
samples, and we classify each path into three categories: (1) existence of model consistent estimates 
with estimation error — w|| less than 10^^, (2) existence of model consistent estimates but none 
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Figure 2: Synthetic example where consistency condition in Eq. (|4| is satisfied (same example as 
the top of Figured] probability of con^ect pattern selection {left) and logarithm of the ex- 
pected mean squared estimation eiTor {right), for several number of samples as a function 
of the regularization parameter, for regular regularization {top), adaptive regularization 
with 7=1 {bottom). 
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Figure 3: Synthetic example where consistency condition in Eq. ^ is not satisfied (same example 
as the middle of Figure [TJ probability of correct pattern selection (left) and logarithm 
of the expected mean squared estimation error (right), for several number of samples 
as a function of the regularization parameter, for regular regulaiization (top), adaptive 
regularization with 7 = 1 (bottom). 



22 



inconsistent - non adaptive 



inconsistent - non adaptive 



E 0.8 

to.6 
t3 

£ 0.4 

o 
o 

5ro.2 




2 4 6 8 
-log(n) 

inconsistent - adaptive (y=1 ) 





2 4 6 8 
-log(n) 

inconsistent - adaptive {y=1 ) 

2r 




Figure 4: Synthetic example where consistency condition in Eq. ^ is not satisfied (same example 
as the bottom of Figure [T] probability of correct pattern selection (left) and logarithm 
of the expected mean squared estimation error (right), for several number of samples 
as a function of the regularization parameter, for regular- regulaiization (top), adaptive 
regularization with 7 = 1 (bottom). 
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Figure 5: Consistency of estimation vs. consistency condition. See text for details. 



with estimation eiTor — w|| less than 10 and (3) non existence of model consistent estimates. 



we plot the proportion of each of the three class as a function of the logarithm of 
T,XiXj^x^x ^^^sidj/\\^j\\)^j ■ The position of the previous value with respect 



In Figure |5 

to 1 is indicative of the expected model consistency. When it is less than one, then we get with 
overwhelming probability model consistent estimates with good eiTors. As the condition gets larger 
than one, we get fewer such good estimates and more and more model inconsistent estimates. 

5.2 Nonparametric Case 

In the infinite dimensional group case, we sampled X € from a normal distribution with zero 
mean vector and a covariance matrix of size m = 4, generated as follows: (a) sample a. m x m 
matrix G with independent standai^d normal distributions, (b) form T,xx = GG^ , (c) for each 
j G {1, . . . , m}, rescale Xj € M so that Y^x^Xj = 1- 

We use the same Gaussian kernel for each variables, k{x,x') = ^-i^-^')"^ _ in this situation, 
as shown in Appendix |E] we can compute in closed form the eigenfunctions and eigenvalues of the 
marginal covariance operators. We then sample function from random independent components on 
the first 10 eigenfunctions. Examples are given in Figure [6] 

In Figure |71 we plot the regulariza tion paths corresponding to 1000 i.i.d. samples, computed 
by the algorithm of iBach et al. I (l2004bh . We only plot the values of the estimated variables fjj , j = 



1, . . . , m for the alternative formulation in Section 12771 which are proportional to \\wj\\ and normal- 
ized so that X^Jli Vj = 1- We compare them to the population values rjj-. both in terms of values, 
and in terms of their sparsity pattern {rjj is zero for the weights which are equal to zero). Figure |7] 
illustrates several of our theoretical results: (a) the top row con^esponds to a situation where the 
strict consistency condition is satisfied and thus we obtain model consistent estimates with also a 
good estimation of the loading vectors (in the figure, only the good behavior of the norms of these 
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-5 5 -5 5 -5 5 -5 5 

Figure 6: Functions to be estimated in the synthetic non parametric group Lasso experiments (left: 
consistent case, right: inconsistent case). 




Figure 7: Regularization paths for the group Lasso for two weighting schemes {left: non adaptive, 
right: adaptive) and two different population densities {top: strict consistency condition 
satisfied, bottom: weak condition not satisfied. For each of the plots, plain curves corre- 
spond to values of estimated fij , dotted curves to population values r]j , and bold curves to 
model consistent estimates. 
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loading vectors are represented); (b) in the bottom row, the consistency condition was not satisfied, 
and we do not get good model estimates. Finally, (b) the right column corresponds to the adaptive 
weighting schemes which also always achieve the two type of consistency.However, such schemes 
should be used with care, as there is one added free parameter (the regularization par ameter k of the 
least-squai^e estimate used to define the weights): if chosen too large, all adaptive weights are equal, 
and thus there is no adaptation, while if chosen too small, the least-square estimate may overfit. 



6. Conclusion 



In this paper, we have extended some of the theoretical results of the Lasso to the group Lasso, for 
finite dimensional groups and infinite dimensional groups. In particulars under practical assumptions 
regarding the distributions the data are sampled from, we have provided necessary and sufficient 
conditions for model consistency of the group Lasso and its nonparametric version, multiple kernel 
learning. 

The current work could be extended in several ways: first, a more detailed study of the limit- 
ing distributions of th e group Las so and adaptive group Lasso estima tors could be c a rried and then 
extend the analysis of IZou (l2006h or lJuditskv and Nemirovski tOOfh and lWuetal.1 tOOA in par- 
ticular regarding converge nce rates. Second, our results should extend to generalized linear models, 
such as logistic regression (|Meier et al.Ll2006l ). Also, it is of interest to let the n umber m of groups or 
kernel s to grow unbounded and extend the results of IZhao and Yul (l2006r) and iMeinshausen and Yu 
(|2006r) to the group Lasso. Finally, similar analysis may be carried through for more general norms 
with different sparsity inducing properties (IB achl 120071) . 



Appendix A. Proof of Optimization Results 

In this appendix, we give detailed proofs of the various propositions on optimality conditions and 
dual problems. 

A.l Proof of Proposition [I 

We rewrite problem in Eq. ([Hi, in the form 

1 . . 1 . 

min -Syr - tyxw + -w'^txx'w + An djVj, 

i=i 

with added constraints that Vj, ^ Vj. In order to deal with these constraints we us e the tools 



from conic programming with the second-order cone, also known as the "ice cream" cone (iBoyd and Vandenberghd . 



20031) . We consider the Lagrangian with dual variables {(3j,jj) G x M such that \\(3j\\ ^ 7j: 

C{W, V, f3, 7) = ^±YY - ±YXW + ^w'^txXW + Xn<fv - (^^^ • 



The derivatives with respect to primal variables are 

Vw^w,v,f3,-f) = txxw -txY - P 
S/yC{w,v,(3,'y) = And -7. 



26 



At optimality, primal and dual variables are completely characterized by w and (3. Since the dual and 
the primal problems are strictly feasible, strong duality holds and the KKT conditions for reduced 
primal/dual variables {w,j3) are 

Vj, ||/3j|K A„dj (dual feasibility) , (24) 
Vj, /3j = tx^xw - txjY (stationarity) , (25) 
Vj, (3jwj + ||t(;j||A„,(ij = (complementary slackness) . (26) 



Complementar y slackness for the second order cone has special co nsequences : w - I3j+\\wj\\\n dj 

Vj — U, \\J J VUj 



if and only if dSovd and Vandenberghel . l2003l . lLobo et alJ.ll998h . either (a) wj = 0, or (b) Wj / 0, 

\\l3j\\ = Xndj and 3rij > such that wj = —j^fij (anti -proportionality), which implies [3j 

"^i^uTfi ^'^^ ~ ll^i ll/'^i- This leads to the proposition. 
A.2 Proof of Proposition |8] 



We follow the proof of Proposition [T] and of lBach et alj (l2004al) . We rewrite problem in Eq. ([T2l) . in 
the form 

mill TT^yy - Syx'W + -W^UxXW + -llnt^, 

with constraints that Vj, \\wj\\ ^ Vj and dJv ^ t. We consider the Lagrangian with dual variables 
7j) G X M and 5 € M+ such that \\[5j \\ ^ 7^, j = 1, . . . , m: 

V, (3, 7, 5) = ^Syy - Syxl" + ^tf^XxXlf^ + ^/^n*^ - - j'^ V + 5{d~^ V - t). 

The derivatives with respect to primal variables are 

Vu,C{w, V, f3, 7) = txxW - t,XY - /?, 

S/i,C{'w,v,(3,'y) = dd-'y, 
VtC{w,v,f3,-f) = Unt-S. 

At optimality, primal and dual variables are completely characterized by w and f3. Since the dual and 
the primal problems ai^e strictly feasible, strong duality holds and the KKT conditions for reduced 
primal/dual variables {w,/3) are 

Vj, l3j = txjxw - ^XjY (stationarity - 1) , (27) 

^ n „ 1 IIAII 

Vj, > aj||z«j|| = — max — - — (stationarity -2) , (28) 



. , ^J'n i=l,-,m di 



T 



Vj) w;, • + lltt'v II max — ; — = (complementary slackness) . (29) 

\dj J i=l,...,m di 

Complementary slackness for the second order cone implies that: 

f^A^ ^11 II 11^*11 n 

J + ll^ill — = ^) 

dj J j=l,...,m di 
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if and only if, either (a) Wj = 0, or (b) Wj and = max — and Brij ^ such that 

i=l,...,m di 

= —r/,/3j//i„, which implies II = max 

f^" i=l,...,m di 

By writing r]j = if Wj = (i.e., in order to cover all cases), we have from Eq. (l28l) 

V™" 1 (iJIwoll = — max li^ii which implies V™ , (i?r?,- = 1 and thus V?, r?, = II'^J/'^j _ 
This leads to Vj, /9j = —Wjfj.n/Vj = "'p^ ^^=1 '^ill^ill- The proposition follows. 

A.3 Proof of Proposition [lOl 



By following the usual proof of the representer theorem (IWahbal. 119901) . we obtain that each optimal 
function fj must be supported by the data points, i.e., there exists a = (ai, . . . , Om) € M">^™ 
such that for all j = 1, . . . , m, fj = X]"^]^ aijkj{-,Xij). When using this representation back into 
Eq. ([T5] ). we obtain an optimization problem that only depends on (pj = Gjaj for j = 1, . . . , m 
where Gj denotes any square root of the kernel matrix Kj, i.e., Kj = GjGj . This problem is 
exactly the finite dimensional problem in Eq. (fT2l) . where Xj is replaced by Gj and wj by (f)j. 
Thus Proposition [8] applies and we can easily derive the current proposition by expressing all terms 
through the functions fj. Note that in this proposition, we do not show that the aj, j = 1, . . . ,m, 
are all proportional to the same vector, as is done in Appendix IA.4I 

A.4 Proof of Proposition [13] 

We prove the proposition in the linear case. Going to the general case, can be done in the same way 
as done in Appendix IA.3I We let X denote the covaiiate matrix in R"^^; we simply need to add a 
new variable u = Xw + 61„ and to "dualize" it. That is, we rewrite problem in Eq. (fT2l) . in the form 

mill — ||y — n|P H — Unt'^, 

with constraints that Vj, \\wj \\ ^ Vj, d~^v ^ t and Xw + bin = u. We consider the Lagrangian with 
dual variables {/3j,jj) € x R and G M+ such that \\f3j \\ ^ -/j, and a G R": 

C{w,b,v,u, 13,^,0,6) = —\\Y-u\\'^+fina^ {u-Xw)+-fj.nt'^-'^^Pjwj + 'yjVj^+6{d~^ v-t). 

The derivatives with respect to primal variables are 

VujJ~-{w,v,u, P,'y,a) = — iinX~^ ol — (3 

VvC{w,v,u, f3,^, a) = 6d — ^ 

VtC{w,v,u,f3,-f,a) = Hnt-5 

1 

\/uJ~-{w,v,u,P,^,a) = -{u -Y + jj.nna) 

n 



Equating them to zero, we get the dual problem in Eq. (1201) . Since the dual and the primal problems 
are strictly feasible, strong duality holds and the KKT conditions for reduced primal/dual variables 
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{w, a) are 



Vj, Xw — Y + Unua = (stationarity - 1) , (30) 
dj\\wj\\ = max j (stationarity - 2) , (31) 

^ — ' i=l,...,m di 



1„ = (stationarity - 3) , (32) 



f-XjaV „ „ (a^Kia)V2 
Vj, — Y — + ll^ill 3 = (complementary slackness) . (33) 

\ dj I i=l,...,m di 

Complementary slackness for the second order cone goes leads to: 

di 

if and only if, either (a) Wj = 0, or (b) Wn ^ and (" -^"j") = m.&x. — ^— ^ — , and Br?, ^ 

■' ■' "■i i=l,...,m di 

/ .>T \ ,■,■,■„ „ , ia^KiaYl'^ 
such that Wj = —Vj [ I , which implies \\wj\\ = rjjdj max . 

By writing rjj = if wj =0 (to cover all cases), we have from Eq. ( [3T| ). Yl^=i '^jll^jll = 
max J , which implies Y1T=^ ^"iVj = 1- The proposition follows from the fact that at 

i=l,...,m di J J 

optimality, Vj, wj = rjjXja. 
A.5 Proof of Proposition [T4l 

What makes this proposition non obvious is the fact that the covariance operator Sxx is not 
invertible in general. From proposition [T3j we know that each fj must be of the form fj = 

r]j Y17=i (^ikjixij, •), where a is uniquely defined. Moreover, t] is such that (^J2Y=i ''Ij^j + ^f^nJn^ « 
Y and such that if " < A, then rjj = (where A = maxj=i....^m ^-^^)- Thus, if the so- 

lution is not unique, there exists two vectors rj ^ C, such that rj and Q have zero components on 
indices j such that Kja < Adj (we let denote J the active set and thus J'^ this set of in- 
dices), and J2Y=ii(j ~ ~ 0- This implies that the vectors UnKja = YlnKjUna, j G J 
are linearly dependent. Those vectors are exactly the centered vector of values of the functions 
Qj = '^i=iaikj{xij,-) at the observed data points. Thus, non unicity implies that the empiri- 
cal covariance matrix of the random variables gj{Xj), j G J, is non invertible. Moreover, we 
have lls'jillr. = a^Kja = d'jA > and the empirical marginal variance of gj{Xj) is equal to 

a^Kja > (otherwise HgjU^r = 0. By normalizing by the (non vanishing) empirical standard 
deviations, we thus obtain functions such that the empirical covariance matrix is singular, but the 
marginal empirical variance are equal to one. Because the empirical covariance operator is a con- 
sistent estimator of Tjxx and Cxx is invertible, we get a contradiction, which proves the unicity of 
solutions. 
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Appendix B. Detailed Proofs for the Group Lasso 

In this appendix, detailed proofs of the consistency resuhs for the finite dimensional case (Theo- 
rems |2] and O ai^e presented. Some of the results presented in this appendix are corollaries of the 
more general results in Appendix O but their proofs in the finite dimensional case are much simpler. 

B.l Proof of Theorem El 

We begin with a lemma, which states that if we restrict ourselves to the covariates which we are 
after (i.e., indexed by J), we get a consistent estimate as soon as A„ tends to zero: 

Lemma 20 Assume fAll]|3]). Let any minimizer of 
If Xn ^ 0, then wj converges to wj in probability. 

Proof If An tends to zero, then the cost function defining wj converges to Fn{wj) = ^^yy — 
^YXjWj + ^wJY^XjXjWj whose unique (because HxjXj is positive definite) global minimum is 
wj (true generati ng value). The convergence of w.t is t hus a simple consequence of standard results 



in M-estimation (|Van der Vaartl.ll998l . lFu and Knighil2000l ). 



We now prove Theorem^ Let wj be defined as in Lemma|20l We extend it by zeros on J^. We 
already know from Lemma |20] that we have consistency in squared norm. Since with probability 
tending to one, the problem has a unique solution (because T,xx is invertible), we now need to 
prove that the probability that w is optimal for problem in Eq. ([T|) is tending to one. 

By definition of wj, the optimality condition Q is satisfied. We now need to verify optimality 
condition Denoting e = Y — X — b, we have: 

Sxy = Sxxw + ±xe = (^xx + Op{n-'/^)) w + Op{n~^/^) = SxXjWj + Op{n-^/^), 



becau se of classical results on convergence of empirical covariances to covariances dVan der Vaart 



1998h . which are applicable because we have the fourth order moment condition (AH]). We thus 
have: 

txY - ^xXjWj = Sxxj(wj -wj) + Op{n~'^/'^). (34) 

From the optimality condition TiXjY—^XjXjWj = A„ Y)mg{dj /\\wj \\)wj defining and Eq. (l34l ). 
we obtain: 

w'J - wj = -A„i;3^jj5^j T)ia.g{dj/\\wj\\)w3 + Op{n~^^'^). (35) 

Therefore, 

txjcY - ^XjcXjWj = i;xjcXj(wj - wj) + Opin-^/"^) by Eq. dMll , 



= XrJ^x,.Xj^x]x, ^'■^s.g{dj/\\wj\\)w3 + Op{n~^''^) by Eq. ([35 
Since w is consistent, and A„n^/^ — > +oo, then for each i G J^, 

dh (^^^^ - ^^•^>-') 
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converges in probability to j-T,x^Xj'^x],Xj '^^^sidj /\\^j\\)^3 which is of norm strictly smaller 
than one because condition @ is satisfied. Thus the probability that w is indeed optimal, which is 
equal to 



diXr 



^XiY — ^X^XjWj 



=C 1 



is tending to 1, which implies the theorem. 



B.2 Proof of Theorem H 

We prove the theorem by contradiction, by assuming that there exists f G such that 

1 



di 



> 1. 



Since with probability tending to one J{w) = J, with probability tending to one, we have from 
optimality condition 



and thus 



W3 = ^xixj [^XjY - \n^'^ag{dj /\\wj\\)w3 



±XiY - ^X.XjWj = (tx^Y - tx,Xj^x]xj^XjY) + \ntx,Xj^x]xj ^^ag{dj /\\wj\\)wj 
— -^n + Bn- 

The second term i?„ in the last expression (divided by A^) converges to 

V = J:x,x,^x]xj Diag(d,/||w,||)wj G M^'^ 

because w is assumed to converge in probability to w and empirical covariance matrices converge 
to population covariance matrices. By assumption > di, which implies that the probability 



PI j {Bn/Xn) > {di + \\v\\)/2)j converges to one. 
The first term is equal to (with Ek = Uk — w^x^ — and epsilon = ^ J2k=i ^k)' 

An = ^x,Y -'^X.Xj'^xlxj^XjY 

= Sx.XjWj - Sx,XjS3^j^jSxjXjWj + - J^x.Xj^xlxj^Xje 

= ^X,e -^X.Xj^XjXj^Xje 
= tx,e - ^X.Xj^xlxj^Xje + Opin~^/^) 
1 " 

= - - e) (^Xki - T,x,Xj^x]xj^kJ^ + Op(n~^/^) = C„ + Op(n~^/^). 

k=l 

The random variable Cn is a is a U-statistic w ith square in tegrable kernel obtained from i.i.d. 
random vectors; it is thus asymptotically normal dVan der Vaart. .199 8i) . We thus simply need to 
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compute the mean and the variance of C„. We have EC„ = because K{Xe) = T,xe = 0. We let 
denote Dk = xh - T^x.Xj^'^ x.^kJ - ^ J2k=i ^ki - T,x,Xj^x]x3^kJ- We have: 



var(C„ 



EC2=E(E(C2|X)) 
E 



^ E 



1 



]^^Eiel\X)DkD 

k=l 

1 " 
^ — ^ 2 



k=l 



C^min (^X,X, - ^Xaj^x]xj^XjX,) , 



where M )p N denotes the partial order between symmetric matrices (i.e., equivalent to M — N 
positive semidefinite). 

Thus n^/^ Cn is asymptotically normal with mean and covariance matrix larger than cr^jj^Sj!^ . ^Xj 
cr^j^ X {Y^XiXi — "^XiXj ^x^Xj ^XjXi ) which is positive definite (because this is the conditional co- 
variance of Xi given and T^xx is assumed invertible). Therefore ¥{-n}/'^v^ An > 0) converges 
to aconstant a G (0, 1), which implies that P |||^~''(74„ + i?„)/A„ ^ (dj + ||i;||)/2| is asymptoti- 
cally bounded below by a. Thus, since + i?„)/A„ II ^ ■p^^(^„ + i3„)/A„ ^ (dj + ||)/2 > 
di implies that w is not optimal, we get a contradiction, which concludes the proof. 

B.3 Proof of Theorem H 

We first prove the following refinement of Lemma l20l 
Lemma 21 Assume fAlUIS]). Let wj any minimizer of 



mmimizer 



If Xn ^ and Xnf^^'^ — > oo, then — wj) converges in probability to 

A = -S^;^^ Diag(d,/||w,||)wj. 

Proof We follow Fu and Knight! (l2000h and write wj = wj + A„ A. The vector A is the 
of the following function: 

F(A) = -Syxj(wj + A„A) + ^(wj + A„A)^SxjXj(wj + A„A) + A„^dj||wj + A„Aj|| 

A^ 

= -AnSyxjA + ^A^SxjXjA + XnwJtxjXjA + Xn^dj (II Wj + A„Aj|| - ||wj||) + 



A^ 

-Xnt.XjA + -^A^txjXjA + Xn dj (||wj + A„Aj-|| - ||Wj-||) + 



est, 
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by using T,yXj = wJSxjXj + ^eXj - The first term is Op{n ^/^A„) = Op(A^), while the last ones 

T 



are equal to \\wj + A„Aj|| — ||wj|| = An 

F{A)/Xl = IaTSx, 



Aj + Op{Xn). Thus, 

W T 



By Lemma |20 



W.7 is Op(l) and th e limiting function has an unique minimum; standard results in 



M-estimation (|Van der Vaarti . 119981) shows that A converges in probabihty to the minimum of the 



last expression which is exactly A = — Sj,^^^^ Diag(dj/||wj ||)wj. 



We now turn to the proof of Theorem HI We follow the proof of Theorem |2] Given w defined 
through Lemma|20]and|2T] we need to satisfy optimality condition Q for all i G 3'^, with probability 

tending to one. For all those i such that ^ Y^XiXj'^x^Xj ^^^s{dj ||)wj < 1, then we know 
from Appendix lB.il that the optimaUty condition is indeed satisfied with probability tending to one. 
We now focus on those i such that j- T,XiXj'^x],Xj -'-)iag((ij7||wj||)wjj = 1, and for which we 
have the condition in Eq. ([6]). From Eq. (l35l) and the few arguments that f'oUow, we get that for all 



^XiY - ^XiXTWj = XnT, 



1 



n^X.Xj^X.X 



Diag(d,-/||«),||)u)j + Op(n-i/2) 



(36) 



Moreover, we have from Lemma |2T] and standai^d differential calculus, i.e., the gradient and the 
Hessian of the function f G M'' i-^ \\v\\ G M are ti/llf || and (iq 



vv 



+ 



A. 



^Jll II "J I 

From Eq. (l36l ) and Eq. (l37l) . we get: 
1 



Pj 



WjW^ 

wjwj 



Aj + Op(A„) 



(37) 



-{tx,Y-tx^x,wj) = Op{n-'/^X-') + J:x,x,^x] 



Diag(dj/||wj||)wj + XnT.XiXj'^x.x, Diag 



dj/\\wj\ 



A + Op{Xn) 



= A + XnB + Op{Xn)+Op{n-^/^X-^). 
Since A„ ^ n~^/^, we have Op(n~^/^A~^) = Op(A„). Thus, since we assumed that ||^| 



|SxiXjS^jXjDiag((ij/||wj| 



— (Sx,y - T.x,XjWj] 

An 



)wj II = di, we have: 

2 

= WAW^ + 2XnA^ B + Op{Xn)dj + Op{Xn) 
= df + Op (An) 



-2A„A ' T,xjx,^x,xj^x]xj Diag I dj/\\wj\\{Ij 



Pj 



WjW^, 

wjwj 



A, 



(note that we have A = —Hx^XjA) which is asymptotically strictly smaller than df if Eq. ^ is 
satisfied, which proves optimality and concludes the proof. 
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B.4 Proof of Proposition |6] 

As in the proof of Theorem [2] in Appendix IB. 1[ we consider the estimate w buih from the reduced 
problem by constraining wjc = 0. We consider the following event: 

El = {Exx invertible and Vj G J, wj / 0}. 

This event has a probability converging to one. Moreover, if Ei is true, then the group Lasso 
estimate has the correct sparsity pattern if and only if for all i € 3'^, 



tx.Xjiwj - Wj) - tx,e 



Moreover we have by definition of wj: SxjXj {wj — wj) — Sxjs 
thus, we get: 



-A„ T)\a.g{dj/\\wj\\)wj, and 



tx.xA'^^ - Wj) - txis 

= T.x,X3^x]xfix,e - Sx,£ - \on-^/'^T.x,XjEx]x3 Diag(c?j/||wj ||)wj + Op{n-^) 

The random vector Sxe G is a multivariate U-statistic w ith square integrable kernel obtained 
from i.i.d. random vectors; it is thus asymptotically normal ( Van der Vaart . 1998h and we simply 
need to compute its mean and variance. The mean is zero, and the variance is ^^cr^Sxx = 
n~^a'^Tixx + o{n~^). This implies that the random vector s of size Card(J'^) defined by 



n 



X,Xj(l«J - Wjj 



is equal to 



aT.x.Xj^XjX,'^^ - ^'^i - ^o^x,Xj^x]xj Diag(dj/||wj||)wj + Op{n ^Z^) 
= fiiu)+Opin~y^), 

where u = a~^n~^^'^T,xe and fi are deterministic continuous functions. The vector f{u) con- 
verges in distribution to f( v) where v is normall y distributed with mean zero and co variance matrix 
Tixx- By Slutsky's lemma (|Van der Vaarti.ll998h . this implies that the random vector s has the same 
limiting distribution. Thus, the probability P(maxj£jc Si/di ^ Aq) converges to 



max — 

ieJ" dj 



^{^X.Xj^X-.Xj'^^ 



XoT^x^x^^xlx, Diag(dj/||wj||)wj 



5^ Ar 



Under the event Ei which has probability tending to one, we have correct pattern selection if and 
only if maxjgjc Si/di ^ Aq, which leads to 



1 



( max — 



ati - \QT.XiX3^x\x3 Diag((ij/||wj||)wj 



< Ar 



where ti = Sa',XjSj^^j^^i;j — Vi. The vector t is normally distributed and a short calculation shows 
that its covariance matrix is equal to Ex3cXjc\Xj^ which concludes the proof. 
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Appendix C. Detailed Proofs for the Nonparametric Formulation 

We first prove lemmas that will be useful for further proofs, and then prove the consistency results 
for the non parametric case. 

C.l Useful Lemmas on Empirical Covariance Operators 

We first have the following lemma, proved by Fukumizu et alJ (|2007h . which states that the empir- 



ical covariance estimator converges in probabihty at rate Op{n ^/^) to the population covariance 
operators: 

Lemma 22 Assume (A^ and (A^. Then — SxxIIjf = Op{n~^/'^) (for the operator norm), 



The following lemma is useful in several proofs: 
Lemma 23 Assume (A®. Then 



^XX + fJ-nl I ^XX — {^XX + fJ-rJ) ^XX 



and 



^XX + l-hj) ^XX — C^XX + tJ-nl) ^^XX 



= 0,(n-i/2^-i). 



+ fJ-nl ) — {^XX + fJ-nl) "^XX 



Proof We have: 



= (j^XX + fJ'nl^ i^XX - ^XX) i^XX + fJ-rJ) ^ ^XX 

This is the product of operators whose norms are respectively upper bounded by Op{n~^^^) 
and 1, which leads to the first inequality (we use ^ The second inequality 

follows along similar lines. ■ 

Note that the two previous lemma also hold for any suboperator of T,xx, i e., for SxjXj , or ^x,x,- 

-1 /o 

Lemma 24 Assume fA|4]), fAH]) and fAl?]). There exists hj € J-'j such that fj = Sj^^^^^hj. 
Proof The range condition implies that 

fj = Diag(S^;^;gj = Diag(S_^^;cl/^^^C-;i^^gj 
(because Cxx is invertible). The result follows from the identity 

= Diag(S^;^;cj/;^^(Diag(E^;^;4/^^J* 



and th e fact that if SxjXj = UU* and f = Ua then there exists P such that / = S^^^^^/? dfiaker 



1973h FI 



4. The a djoint operator V* of V : J-i ^ jTj is so that for all f £ and g € Tj, (/, Vg)^, = {Vf, g)T, terezisL 
Il980l) . 
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C.2 Proof of Theorem [IT] 

We now extend Lemma|20]to covariance operators, which requires to use the alternative formulation 
and a slower rate of decrease for the regularization parameter: 

Lemma 25 Let fj be any minimizer of 

-Syy - (Sxjy,/j)j^j + ^{fj,t.XjXjfj)j^j + Y' I ^djWfjW^j ] ■ 

\je3 J 

If fJ'u ^ and /i^n^/^ +00, then ||/j — fjljjFj converges to zero in probability. Moreover for 
any r]n such that r]n > fh/'^ + lin^n"^/"^ then ||/j - fj||jrj = Op{r]n)- 

Proof Note that from Cauchy-Schwartz inequality, we have: 



i/2||p iii/2 ^ dy^wfjy, 



vi6J / \jeJ \\y\\:F, 



with equality if and only if there exists a > such that H/j = a||fj for all j € J. We consider 
the unique minimizer /j of the following cost function, built by replacing the regularization by its 
upperbound. 

Since it is a regularized least-square problem, we have (with e = Y — X^jgj — b): 



where D = (^Xljgj I-^i8'g('^i/l|fjll)- Note that D is upperbounded and lowerbounded, as 

an auto-adjoint operator, by strictly positive constants times the identity operator (with probability 
tending to one), i.e., Djaa-^Ir^ > D )^ D^nmlj^j with L»min, D^a.x > 0. We now prove that fj - fj 
is converging to zero in probability. We have: 



f^XjXj + l^nD) ±xje = Op(n-^/Vn'), (38) 



because of Lemma |22] and 



{^XjXj + fJ'uD 



^ ^min/^n^- Morcovcr, similarly, we have 



txjXj + fJ'nD) SxjXj/j - (SxjXj + Mn-D) T^XjXjfj = Op{n (39) 
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Besides, by Lemma l23l 



-1 



Thus /j - fj = y + Op(n-i/VrT^). where 



V 



We have 



\VfT, = l^l{fj,D{J:xjX,+l^nD)-^D{j):f^ 

< -C)max/^n(hj, ^XjXj {^XjXj + l^nD^nujy^ hj)^j by Lemma 

^ -C>jjjax/^n||hj||jr . 



(40) 



Finally we obtain ||/j - fjUjTj = Opi^fil/"^ + n 



-l/2„-l^ 



We now consider the cost function defining fj : 



Fnifj) = 2^YY - {T,XjY,f3)j^j + ■^{f3,^XjXjf3)j^3 + ^ ( '^j ll/i ll^j ) ■ 

We have (note that although we seem to take infinite dimensional derivatives, everything can be 
done in the finite subspace spanned by the data): 



Fn{fj)-F{fj) = ^ 



V/,F„(/j)-V/,F(/j) = /X, 




vie J 



djfi 




difi 



Since the right hand side of the previous equation corresponds to a continuously differentiable func- 
tion of /j around fj (with upper-bounded derivatives around fj), we have: 



for some constant C > 0. Moreover, on the ball of center /j and radius ry„ such that » 
[Xrl'^ + ii~^n~^^'^ (to make sure that it asymptotically contains fj, which implies that on the ball 
each /j, j € J are bounded away from zero), and ^ 1 (so that we get consistency), we have a 

lower bound on the second derivative of ( X^jgj i^j || /j ll^^j ) ■ Thus for any element of the ball, 



i^n(/j) ^ Fn(/j) + (V/,F„(/j), (/j - /j))^, + C'/i„,||/j - /jII^,, 



37 



where C' > is a constant. This implies that the value of Fn{fj) on the edge of the ball is larger 
than 

Thus if 7]'^fin ^ VnfMi'^ and rj'^fin ^ n~^^^r]n, then we must have all minima inside the ball of 
radius r]n (because with probability tending to one, the value on the edge is greater than one value 
inside and the function is convex) which implies that the global minimum of Fn is at most rjn away 

- ~ 1 /2 

from fj and thus since /j is 0{fin ) away from fj, we have the consistency if 

r]n < 1 and r/„ > //^^ + n'^^'^^x'^, 
which concludes the proof of the lemma. ■ 

We now prove Theorem[TT] Let fj be defined as in Lemma|20l We extend it by zeros on J*^. We 
already know the squared norm consistency by Lemma l20l Since by Proposition [I4l the solution is 
unique with probability tending to one, we need to prove that with probability tending to one / is 
optimal for problem in Eq. (031) . We have by the first optimality condition for /j: 

tx,Y - ^x.xjj = ;U„||/|UDiag(d,/||/,||)/j, 

where we use the notation = E7=i ll/i 11^, (note the difference with ||/||^ = {^.7=1 ll/i 113=-, )'^')- 
We thus have by solving for /j and using T,XjY = SxjXjfj + '^x, 



fj = (^^XjXj + fJ'nF>n^ (j^XjXjh + ^Xje J , 

with the notation D„ = Diag((ij/|[/j ). We can now put that back into Sxjcy — SxjcXj/j 
and show that this will have small enough norm with probability tending to one. We have for all 
i e J^: 

tx.Y-^X.Xjfj = txiV -tx.Xj (j^XjXj + fJ-nDn^ (txjXjh+^Xje 
= -^X,X3 {^XjX3 + fJ-nF)n^ txjXjj 

+ T.x,Y - ^X,Xj (SxjXj + fJ'uDn^ Sxje 
= —^X.Xjij + '^XiXj (^^XjXj + l^hiF>T?j llnDn^i 

+T,XiY - '^XiXj i^XjXj + fJ'uDn) Sxje 



SXiXj (^^XjXj + fJ-nFfn^ /^*n-Cnfj 



+'^X,s - "^X.Xj ( ^XjXj + fJ'nF>n ) ^Xjs (41) 



The first term An (divided by fin) is equal to 

A 
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We can replace T,x,Xj ^ by T^x.Xj at cost Op{n because (fj, S^^^^jfj)^^^ < oo (by 



Lemma l24)) . Also, we can replace T^XjXj in -jf - by SxjXj at cost 0^(71. ^''^fJ'n^) as a consequence 
of Lemma [231 Those two are Op(l) by assumptions on fin- Thus, 



Sx.Xj + /^ni^n) ^ Dnfj + Op(l). 



Furthermore, we let denote D = \\f\\d Diag(dj/||fj )■ From Lemmal25l we know that Dn — D 
Op{l). Thus we can replace Dn by D at cost Op(l) to get: 



We now show that this last deterministic term C„ € Ti converges to: 

C = ^x^x.C'x.XjCxlxj^SJ, 

1 /2 

where, from (A|7ll, Vj G J, fj = Sj^ gj. We have 



Cn ~ C 



^X^X,Cx,Xj 
U/2 



1/2 



■j^ /2 1 1/2 1 

where Kn = Diag(i;^^ ) (SxjXj + Mn-D) Diag(i;^^ ) - C'^.x,- I" addition, we have: 



1/2 



Diag(sV^v,)CxjXji<"n = T.XjXj i^XjXj + fJ-uD) ^ Diag(s5/f^ ) - Diag(sl/^. ) 



1/2 



.1/2 



-UnD {T,XjXj + IJ'nD) ^ Diag(S^ 



1/2 



FoUowing lFukumizu et all (120071) . the range of the adjoint operator ^S^^. Cx.Xj j = CxjX, '^XiXi 
is included in the closure of the range of Diag(SxjXj ) (which is equal to the range of T,XjXj by 
Lemmal24l). For any vj G Tj in the intersection of two ranges, we have vj = CxjXj Diag(S^^j5^ )uj 
(note that CxjXj is invertible), and thus 

{KnDgj,vj)jr^ = (J^„i:>gj,CxjXj Diag(E5/V)nj)^j 



.1/2 



{-finD (SxjXj + fJ-nD) ^ Diag(S^/^^ )L»gj, nj)^j 



1/2 

which is Op{fin ) a nd thus tends to zero. S ince this holds for all elements in the intersection of the 
ranges. Lemma 9 by iFukumizu et all (l2007h implies that ||C„ — C\\j^j converges to zero. 

We now simply need to show that the second term i?„ is dominated by We have: ||Sxje = 

Op(n~^/2) and \\T,x,Xj i^XjXj +Mn-D„) ^Xje\\j^, < \\^x,e\\j^,, thus, since ^„n^/2 +oo. 



Bn = Op{fin) and therefore for for each i E J^, 

1 



dj/iri||f lid 



T.X,Y - ^X.Xjfj 
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converges in probability to ||C||jFj/(ij||f ||d which is strictly smaller than one because Eq. (fTSl) is 
satisfied. Thus 



C^i^n||f lid 



is tending to 1, which implies the theorem (using the same arguments than in the proof of Theorem[2] 
in Appendix lB.il) . 

C.3 Proof of Theorem [12] 

Before proving the analog of the second group Lasso theorem, we need the following additional 
proposition, which states that consistency of the patterns can only be achieved if /i„n^/^ — oo 
(even if chosen in a data dependent way). 

Proposition 26 Assume fA|4l|7l) and that J is not empty. If f is converging in probability to f and 
J{f) converges in probability to J, then oo in probability. 

Proof We give a proof by contradiction, and we thus assume that there exists M > such that 

n 

M) > 0. This imposes that there exists a subsequence which is almost 
surely bounded by M (|Durrettl. 120041 '). Thus, we can take a further subsequence which converges to 
a limit fiQ G [0, oo). We now consider such a subsequence (and still use the notation of the original 
sequence for simplicity). 

With probability tending to one, we have the optimality condition ([17] ): 



SXje + SxjXjfj = ^XjY = SxjXj/j +/in||/|UDiag(dj/||/j||^J/j. 

If we let denote D„ = n^^'^ UnWfWd Biag{dj /\\fj\\jr^ ), we get: 



-1/2 



n 



1/2 



fj 



which can be approximated as follows (we denote D = \\i\\d Diag((ij/||fj ||jf^ )): 



We can now write for i G J'^: 



fj-/j 



n 



1/2 



n'/^^x,e + Sx,Xjn^/^(fj - /j) + Op(l). 



1/2, 



We now consider an arbitrary vector wj € J^j, such that TiXjXjWj is different from zero (such 
vector exists because T,XjXj 0, as we have assumed in (AlH that the varia bles are not constant). 
Since the range of ^XjXi is included in the range of SxjXj ( Bakej. fl973l ). there exists Vi € J-'i 
such that T,XjXiVi = ^XjXjWj. Note that since T^XjXjWj is different from zero, we must have 



E^/^^^Ui / 0. We have: 



n 



1/2, 



Vi, Sx.y - ^x,Xjh)y'r 



n^l^{v,,tx,e)r. + {w,,T.x,x,n^'\h - /j))^, + Op(l) 
n^/'^{vi, tx,e)T, + {wj,l^oDfj - n^/'^txjs):Fj + Op(l) 
(u;j,/xo-D/j)^j +n^/'^{vi,tx,s)j^, - n^^^{wj,^Xje)j^j +Op(l). 
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The random variable En = n^^'^{vi, Sx^e) — n^^'^i'Wj, '^Xje) is a U-statistic \ yith square integrable 
kernel obtained from i.i.d. random vectors; it is thus asymptotically normal dVan der Vaarti Il998h 
and we simply need to compute its mean and variance. The mean is zero and a short calculation 
similar- to the one found in the proof of Theorem|3]in Appendix IB.2l shows that we have: 

= (1 - '^/n){(Tl,^n{vuT.x,X,Vi):F, - f7^in(v», 

The operator C^^^-^^CxjX^ has the same range as CxjXj (because Cxx is invertible), and is 
thus included in the closure of the range of Diag(S^^j^ ) ( Baker , 1973b . Thus, for any u € J^i, 

1 1 /2 

CxjXj^XjXiU can be expressed as a limit of terms of the form Diag(S^^ )t where t € Tj. We 
thus have that 



{u,Cxaj Diag(S^^.^pu;j)^, = {u,Cx,XjCx]xj'^XjXj Diag(S^^j^pu;j)^, 
can be expressed as a limit of terms of the form 

{t,Diag(E^J^pCxjXjDiag(E^/^^Pu;j)^j = (t, = {t,^XjX,Vi)^^ 

= {t,Diag{i:][^x^)CxjX,^xfx,^i)^j {u,Cx,XjCx]xjCxjX,^x^x,^i)T,- 

1/2 1 1/2 

This implies that Cx.Xj Biag{T.j^^x^)wj = Cx,XjCxjXj^XjX,^£x,'"i' ^e have: 

^ (^minivi, '^x,x,Vi)jr^ - ^^^^(wi, S^^^^^Cx^Xj Diag(S^J^^ ) 

= cr^in(^ii '^X,X,Vi)T^ " 0-mm("i) ^X^X,'^^>^J^x]xj^^J^.^X^X,'^»)-^« 
= 0-mm(SxiX,'^i5 {Ij'r - Cx,X:iCx]xj^XjX,)^]i%Vi)jr^. 

By assumption (AlU), the operator Ij:. — C'xiXjC'x^Xj^^j^i i^ lower bounded by a strictly positive 

constant times the identity matrix, and thus, since 7^ 0, we have 'EE^ > 0. This implies 

that n^/'^{vi, T,XiY — ^XiXjfj) converges to a normal distribution with strictly positive variance. 

Thus the probability P (^n^/2(i;i, Sx,r - ^x,Xjfj)y^, > di\\f\\d\\vi\\jr^ + 1^ converges to a strictly 

positive limit (note that \\f\\d can be replaced by ||f without changing the result). Since /i„n^/^ 
/xo < 00, this implies that 



(^/x,/(i;i,Sx,y - SxiXj/j).^, > t^ill/lUlbill.^.) 



is asymptotically strictly positive (i.e., has a strictly positive liminf). Thus the optimality condi- 
tion ( fT6l ) is not satisfied with non vanishing probability, which is a contradiction and proves the 
proposition. 



We now go back to the proof of Theorem [T2l We prove by contradiction, by assuming that there 
exists i € J'^ such that 



1 

di 



^x!xCx^x,Cx]x,^i^s{djmjy,)gj 



> 1. 
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Since with probability tending to one J(/) = J, with probability tending to one, we have from 
optimality condition ([17] ). and the usual line of arguments (see Eq. (|4T] ) in Appendix IB .21 ) that for 
every i G J^: 



^XiY — ^XiXjfj — fJ-n^X^X J i'^XjX J + fJ-nDn] Dnf 



-1 



+SXi£ — ^XiXj y^XjXj + I^TiDnj Sxje, 

where Dn = Diag((ij/||/j- 1|). Following the same argument as in the proof of Theorem [TT] 
(and because finn^^"^ +oo as a consequence of Proposition |26l). the first term in the last expres- 
sion (divided by converges to 

= ^xfx,Cx,XjCx]xj\\f\\d^^^s{dj/\\fjy,)gj 
By assumption > (ij||f We have the second term: 



- Sx.Xj [^XjXj + fln\\f\\d'DiaS{dj/\\fj\\r,)) ^Xje 

= Opin-'/')-txa, +^*n||f||dDiag(d,/||f,||^^,))^'sx,. + Op(n-V2) 

The remaining term can be bounded as follows (with D = ||f ||^ Diag((ij/||fj )): 

E 







) '^X,e 









n 

2 

'^max 
^ n 



trSx.Xj ( SxjXj + fJ-nD ) txjXj (^XjXj + fJ-nD) SxjX, 



which implies that the full expectation is 0(n ^) (because our operators are trace-class, i.e., have 
finite trace). Thus the remaining term is Op(n~^/^) and thus negligible compared to fin, therefore 

^ ' SxiT — ^XiXj fj ) converges in probability to a limit which is of norm strictly greater than 



Mnll/lld 

di. Thus there is a non vanishing probability of being strictly larger than di, which implies that with 
non vanishing probability, the optimality condition ([T6l ) is not satisfied, which is a contradiction. 
This concludes the proof. 

C.4 Proof of Proposition [Is] 

Note that the estimator defined in Eq. ([23] ) is exactly equal to 



tx^xA^XjX, + Kn/)-' Diag(d,/||(/if ),||^J(/i^,f )j 



Using Proposition IT7\ and the arguments from Appendix IC.2l by replacing / by Fis, we get the 
consistency result. 
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Appendix D. Proof of Results on Adaptive Group Lasso 
D.l Proof of Theorem [16] 

We define w as the minimizer of the same cost function restricted to wjc = 0. Because w^^ is 
consistent, the norms of Wj^ for ? € J are b ounded away from zero, and we get from standai^d 



results on M-estimation (|Van der Vaarti. Il998h the normal limit distribution with given covariance 
matrix if ^„ ^ n^^/^. 

Moreover, the patterns of zeros (which is obvious by construction of w) converges in probability. 
What remains to be shown is that with probability tending to one, w is optimal for the full problem. 
We just need to show that with probability tending to one, for all i G J^, 

\\txie -tx.Xjiwj -Wj)\\ ^ UnWwlldWwt^ \\~^ ■ (42) 

Note that converges in probability to ||w||i^ > 0. Moreover, — Wj II = Op (n- 1/2). Thus, 
if i G 3'^, i.e., if fj = 0, then ||wf"^|| = Op{n~^/'^). The left hand side in Eq. (l42l ) is thus upper 
bounded by Op{n~^^'^) while the right hand side is lower bounded asymptotically by /i^n'''/^. Thus 
if n~i/2 = o{nnn'^^'^), then with probability tending to one we get the correct optimality condition, 
which concludes the proof. 



D.2 Proof of Proposition [17] 

We have: 

and thus: 



= (Sxx + Kniy^ Sxxf - f + Op(n"i/2^-i) from Lemma[23] 

Since f = S^^g, we have || — (Sxx + K-nIr)~^ «^nf ||^ ^ C'^n||g||3r, which concludes the proof. 
D.3 Proof of Theorem [18] 

We define / as the minimizer of the same cost function restricted to /jc = 0. Because f^-i/a is 
consistent, the norms of {f^^x/z)j for J G J are bounded away from zero, and Lemma [25] applies 
with /x„ = /io^^~^/'^, i.e., / converges in probability to f and so are the patterns of zeros (which is 
obvious by construction of /). Moreover, for any ?] > 0, from Lemma l25l we have ||/j — /j|| = 
Op(n-i/6+") (because ^ln^''^ + n'^/'^ii-^ = Op{n-^'^)). 

What remains to be shown is that with probability tending to one, / is optimal for the full 
problem. We just need to show that with probability tending to one, for all i G J'^, 

Wtx^e - tx^xAh - h)\\ ^ /^n||/||d||(4%3)^||^7. (43) 

Note that ||/||d converges in probability to ||f ||d > 0. Moreover, by Proposition [TTl ||(/^^i/3)i — 
fill = Op(n-i/6). Thus, if i G i.e., if fj = 0, then ||(/^_:^i/3)i||^, = Op{n-^/^). The left hand 
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side in Eq. (|43] | is thus upper bounded by Op(n~^/^ + n^^/^"'"'') while the right hand side is lower 
bounded asymptotically by n~^l'^nil'° . Thus if— 1/6 + r/ < — 1/3 + 7/6, then with probability 
tending to one we get the correct optimality condition. As soon as 7 > 1, we can find r] small 
enough and strictly positive, which concludes the proof. 

D.4 Range Condition of Covariance Operators 

We let denote C((7) the convolution operator by q on the space of real functions on W and T(p) 
the pointwise multiplication by 'p{x). In this appendix, we look at different Hilbertian products of 
functions on W , we use the notations (•, •)jf and (•, ■)ip.[pjr) and (•, •)j^2(jjp) for the dot products in 
the RKHS T, the space L'^{px) of square integrable functions with respect to p{x)dx, and the space 
L'^{MP) of square integrable functions with respect to the Lebesgue measure. With our assumptions, 
for all f,ge L'^iWP), we have: 



{f,g)L^ = {C{q)'/^f,C{q)'/^g) 



Denote by {Afc}fc>i and {ek\k>\ the positive eigenvalues and the eigenvectors of the covariance 
operator respectively. Note that since was assumed to be strictly positive, all eigen- 

values are strictly positive (the RKHS cannot contain any non zero constant functions on M^). For 
A; ^ 1, set /fc = A^^^^(efc — /j^p ek{x^Vx{x^Ax^. By construction, for any k,i 1, 



^kSk,e = (efc,Se^)jr = /_ px{x){ek - ff,p ek{x)px{x)dx){ei - f^p ei{x)px{x)dx)dx 

1/2 1/2 / 1/2 1/2 

= V ^/ / Px{x)fk{x)ft{x)dx = \^ {fkJe)L2{px) 



Thus {/fcjfc^i is an orthonormal sequence in L^{px)- Let / = C{q)g for g € L'^{W) such that 
/kp g{x)dx = 0. Note that / is in the range of ^xx ^^^y if^ ^~^f)j^ finite. We have: 



^ 2 

g{x)ep{x)dx 



(/, S- V)^ = ^ {ep, f)% = A"^ (cp, <7)i2(Kp) = VM / 
p=i p=i p=i ^-^^ 

E/ -1 J. \2 ^11 _1 ,|2 / 9 {X) 

{Px9Jp)l2^p^)^\\Px9\\lHpx) = 

„ -1 J M 



px{x) 



dx, 



p=i 

because {fk}k^i is an orthonormal sequence in L'^{px)- This concludes the proof. 
Appendix E. Gaussian Kernels and Gaussian Variables 

In this section, we consider X G M'" with normal distribution with zero mean and covariance matrix 
S. We also consider Gaussian kernels kj{xj,x'j) = exp{—bi{xj — x^)^) on each of its component. 
In this situation, we can find orthonormal basis of the Hilbert spaces J^j where we can compute the 
coordinates of all covariance operators. This thus allows to check conditions ([18] ) or ([T9l ) without 
using sampling. 
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We consider t he eigenbasis of t he non centered covariance operators on each Tj, j = 1, . . . ,m, 
which is equal to ( Zhu et al. , 1998h : 

\ a- 2'^k\ 



1/2 



with eigenvalues = (^^j ' {Bj)^, where Oj = l/ASu, Cj = (a|+2aj6j)^/^, Aj = aj+bj+Cj 
and Bj = bj/Aj, and Hk is the fc-th Hermite polynomial. 

We can then compute all required expectations as follows (note that by definition we have 



MkiXj) = ( 



' {aj + Cj) \k 



2k 



1/2 



2{cj + ttj) 



( 1/2 1/2 



1/2 



aj'^aj'^2^2^k\i\ ) 



DkeiQij), 



where Qi 



i(l - ai/c) 

|(l-a,/c,) 



+ i 



o. 1/2 1/2 



C.. 1/2 1/2 



and 



ki 



{Q)= [ 



exp 



Q 



H}^{u)H^{v)dudv^ 



for any positive matrix Q. For any given Q, Dk£{Q) can be computed exactly by using a singular 
value decomposition of Q and the appropriate change of variablesj^] 
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